Freedom of Information Request Reveals Facial Recognition Error Rate
South Wales Police Trial Generated a “92% Facial Recognition False Positives Rate”?
It’s been widely reported that facial recognition error rate on the system deployed by South Wales Police in Cardiff has been generating an inaccurate matching performance of a “92% False Positive Rate”. In response to a freedom of information request, it has been revealed that:
- 170,000 people arrived in the Welsh capital for the football match between Real Madrid and Juventus.
- 2,470 potential matches were identified.
- 92% (2,297) were deemed to be non-matches.
… and the press and social media sphere have largely been united in its critical reporting and commentary:
- “Welsh police wrongly identify thousands as potential criminals“, reports The Guardian.
- “Police Tested Facial Recognition at a Major Sporting Event. The Results Were Disastrous“, states Fortune.com.
… with multiple outlets citing that the system was performing at a “92% False Positive Rate” or an “8% Accuracy Rate”. The Twittersphere and social media in general erupted with disparaging remarks towards the police and the technology.
But Wait! What Does this Even Mean?
You can read a Facial Recognition Accuracy Worked Example by clicking here.
Simply put … the layman’s takeaway
- Face Recognition does not identify criminals.
– Humans do. The system is a tool to improve the human’s efficiency.
- A False Positive Rate:
– IS a measure of the number of false matches against the total face comparisons made.
– IS NOT a measure of the number of false matches against the total number of matches made.
So, in a hypothetical simple system of:
- 1 camera.
- 10 criminals in the watchlist.
- 100 people crossing the camera every hour.
- Operating at a False Positive Rate of 0.2%.
i.e. on average 1 in every 500 comparisons will generate a false positive.
… there will be:
- 1,000 face comparisons per hour (100 people crossing the camera x 10 people in the watchlist).
- on average 2 false positives every hour (irrespective of how many people, if any, in the watchlist cross the camera).
The bigger the watchlist, the more false positives. The more people crossing the camera, the more false positives.
Importantly, the system:
- IS NOT falsely identifying two people every hour as criminals.
- IS drastically increasing officer efficiency
… by enabling them to quickly assess 2 people each hour, instead of 100 people each hour.
The accuracy of a system is determined by how many people in the watchlist that cross the camera are missed by the system (i.e. slip through in the 98 per hour and potentially not assessed by a human) for the given false positive rate.
As an example, even with an incredibly accurate system and with one person in the watchlist crossing a camera in a 10 hour period that is identified, then there would be:
- 1 true positive.
- 20 false positives (on average).
… but the system would still be operating at a false positive rate of 0.2%, NOT 95% (20/21) using the reporting paradigm favored by the media.
Consider, for example, if NOBODY in the watchlist crossed the camera in that 10 hour period. There would still be on average 20 false positives, but that does not mean the system is 100% inaccurate.
For more detail, read on …
In a facial recognition system:
- a False Positive (Accept) Rate is defined as the “…expectancy of falsely accepting that two face images of two different people are of the same person.”
- a False Negative (Reject) Rate is defined as the “…expectancy of falsely rejecting that two face images of the same person are in fact of the same person.
… and these two rates are inversely proportional. If you tune the system to minimise False Positives, you’ll miss more people that are in your watchlist that cross the cameras and vice versa.
You can read more about it here: Face Recognition in Airports.
Was the system operating as poorly as reported?
Simply put, there is not enough information to even begin to answer this question. We would need to know additional elements, such as:
- How many faces were in the watchlist?
- How many people crossed the cameras?
- How many False Negatives were there? (How many people did the system miss that were in the watchlist and crossed the cameras?)
From the information at hand, the system may be operating exceptionally well, or exceptionally poorly.
Actual False Positive Rates
Accurately, a False Positive Rate should be reported as the ratio of the number of False Positives and the total number of faces comparisons made. As an example, if there were 100 faces in a watchlist and 100 people crossed the camera, there would be 10,000 comparisons (each person crossing the cameras against each watchlist entry). If there was 1 False Positive, that would equate to a 0.01% False Positive Rate.
To put this into context, in an airport, e-Gates are typically tuned to operate at a very low false positive rate of 1 in 10,000 (0.01%). Meaning that for every 10,000 people that try to go through an e-Gate with somebody else’s passport, 1 will be successful. A non-compliant surveillance system will generally operate at a higher False Positive Rate as we are dealing with poorer quality images and a real world environment, not a controlled e-Gate.
Looking at the system reported on in South Wales, in the worse case scenario, if only 2,470 people actually walked past the cameras and if there was only 1 person in the watchlist, the the system would indeed be operating at a 92% False Positive Rate (2,297 / (2,470 * 1). Abysmal, but unlikely.
In reality, tens of thousands of people will have walked past the cameras and there were likely to have been hundreds of faces in the watchlist.
Let’s speculate that 100,000 of the 170,000 people that arrived were detected by a camera and that there was only 1 face in the watchlist. The system then would have been operating at a 2.3% (2,297 / 100,000 * 1) False Positive Rate.
There clearly will have been significantly more than 1 person in the watchlist and each person may have crossed multiple cameras initiating multiple searches. So the system very likely will have been performing at a much better rate than this.
Without knowing the size of the watchlist, the actual number of people detected by a camera and how many people were not matched that should have been, it is impossible to come to a determination if the system is operating well or not.
So was it actually a “92% False Positive Rate”?
Clearly not. Let’s assume that a system is operating incredibly well and :
- did not miss anybody that walked past the cameras that was in the watchlist (a 0% False Negative Rate, which is unlikely, but just for argument’s sake).
i.e. the people that were detected were the ONLY people in the watchlist that crossed a camera.
- is operating at a False Positive Rate of 0.01%
If 10,000 people walked past the cameras in one day and there were 100 people in the watchlist, we would expect roughly 100 False Matches. (10,000 * 100 * 0.01%)
Additionally, if 100 people in the watchlist crossed a camera and the system successfully caught them all (again, unlikely), there would be 200 matches generated in total.
Using the same reporting paradigm widely used this week, a False Positive Rate of 50% (100/200) would now be reported.
If, however, 100,000 people walked past the camera over ten days, operating at exactly the same accuracy level, we would now expect roughly 1,000 false positives in total, 100 a day.
If the same 100 people in the watchlist crossed a camera and the system successfully caught them all, there would be 1,100 matches in total.
Using the same reporting paradigm widely used this week, a False Positive Rate of 90.9% (1,000 / 1,100) would now be reported.
One of the flaws in the way the metrics have been represented this week, is that with a consistent number of people crossing the cameras, a system will generate a consistently increasing number of false positives corresponding to the actual False Positive Rate. However, if there are “n” faces in the watchlist, there are only ever “n” possible genuine positive matches and these are not generated in any predictable fashion. This erroneous interpretation will falsely show a continuously increasing False Positive Rate and a dropping accuracy level.
What Was the Actual Performance of the System?
With all the missing data elements, such as traffic volumes, the size of the watchlist and the number of matches missed, it is impossible to come to a determination on the true effectiveness of the system.
Allevate‘s extensive experience in deploying a cloud-based facial recognition system in Brazil has demonstrated that even an accurate system will generate a consistent daily number of False Positives, and you can never be sure of when you will realise a genuine match. Each generated match needs to be reliably adjudicated, and the key is to minimise these so as to minimise the adjudication workload and to maintain confidence in the system. If these processes are well defined and established, a system can contribute significantly to public safety by generating reliable matches to be acted upon.
What we do know is that there were 450 additional arrests generated by this system in question over the past 9 months that likely otherwise would not have occurred, using a fraction of the police resource that would otherwise have been required.
How accurate was the system? Tough to say. But it is clear that the manner in which these results have been represented is perhaps less accurate than the system itself.
This problem cannot be resolved with simple logics. Only feature selection and network structure will overcome these problems. For example in ayonix, we deal with such cases for long years.
I wonder their database and cameras.
I fear your system, and any system, would be reported on just as badly by Big Brother Watch, as the problem is not with the system, but the manner in which the results are being misrepresented.