Base rate neglect and Andrew Ross Sorkin's credit card surveillance system

The New York Times' Andrew Ross Sorkin published an article on Christmas Eve, to argue that credit card companies should build models that take spending activity as input and return "probability that this customer is planning a mass shooting" as output.An excerpt from the crux of it:

A New York Times examination of mass shootings since the Virginia Tech attack in 2007 reveals how credit cards have become a crucial part of the planning of these massacres. There have been 13 shootings that killed 10 or more people in the last decade, and in at least eight of them, the killers financed their attacks using credit cards. Some used credit to acquire firearms they could not otherwise have afforded.Those eight shootings killed 217 people. The investigations undertaken in their aftermath uncovered a rich trove of information about the killers’ spending. There were plenty of red flags, if only someone were able to look for them, law enforcement experts say.

Sorkin is well-known for having used his NYT column in the weeks after the Parkland massacre to successfully lobby Citigroup and Bank of America to fire their business customers who sell standard-capacity magazines and other common touchstones. So people on all sides reacted predictably to his new article.Some people loved the idea:

Civil liberties advocates and gun rights advocates were, as a rule, less sanguine:

The philosophical disagreements are well-known to anyone who has studied them, and I won't rehash them here beyond saying that Jonathan Haidt's moral foundations theory captures it well (particularly the theory's care/harm, authority/subversion, and liberty/oppression axes).That can be a fun discussion to have, but it's not one that'll teach us anything new. Instead, we're going to examine something that almost everybody missed: the straightforward innumeracy of Sorkin's article.Base rate neglect is a simple and counterintuitive idea: the chance of a test giving a false positive is determined both by the test's accuracy and by the prevalence of the condition that the test is looking for.Wikipedia explains, using a hypothetical terrorist detector machine (lightly edited here for brevity):

In a city of 1 million inhabitants let there be 100 terrorists and 999,900 non-terrorists. Thus, the base rate probability of a randomly selected inhabitant of the city being a terrorist is 0.0001, and the base rate probability of that same inhabitant being a non-terrorist is 0.9999. In an attempt to catch the terrorists, the city installs an alarm system with a surveillance camera and automatic facial recognition software.The software has two failure rates of 1%:
  • The false negative rate: If the camera scans a terrorist, a bell will ring 99% of the time, and it will fail to ring 1% of the time.
  • The false positive rate: If the camera scans a non-terrorist, a bell will not ring 99% of the time, but it will ring 1% of the time.

Suppose now that an inhabitant triggers the alarm. What is the chance that the person is a terrorist? In other words, what is P(T | B), the probability that a terrorist has been detected given the ringing of the bell? Someone making the "base rate fallacy" would infer that there is a 99% chance that the detected person is a terrorist.The fallacy arises from confusing the natures of two different failure rates. The "number of non-bells per 100 terrorists" and the "number of non-terrorists per 100 bells" are unrelated quantities. One does not necessarily equal the other, and they don't even have to be almost equal. To show this, consider what happens if an identical alarm system were set up in a second city with no terrorists at all. As in the first city, the alarm sounds for 1 out of every 100 non-terrorist inhabitants detected, but unlike in the first city, the alarm never sounds for a terrorist. Therefore, 100% of all occasions of the alarm sounding are for non-terrorists, but a false negative rate cannot even be calculated. The "number of non-terrorists per 100 bells" in that city is 100, yet P(T | B) = 0%. There is zero chance that a terrorist has been detected given the ringing of the bell.Imagine that the first city's entire population of one million people pass in front of the camera. About 99 of the 100 terrorists will trigger the alarm—and so will about 9,999 of the 999,900 non-terrorists. Therefore, about 10,098 people will trigger the alarm, among which about 99 will be terrorists. So, the probability that a person triggering the alarm actually is a terrorist, is only about 99 in 10,098, which is less than 1%, and very, very far below our initial guess of 99%.

We can apply the same math to estimate a false-positive rate for Sorkin's mass shooter detector. We'll make a few assumptions about inputs, in each case being maximally generous:

  • False positive rate of the credit card company's machine learning system, with respect to mass shooters: 1%
  • False negative rate: 1%
  • Number of people using a credit card to make any gun-related purchase in a given year: 25 million. Pew estimates that 30% of American adults (just over 75 million Americans) own a gun. We'll say that only one-third of them make any gun-related purchase (any ammo, guns, or accessories whatsoever) on a credit card in a given year.  That's very likely to be an underestimate, but we'll stick with a low number because that's more generous to Sorkin's model.
  • Number of credit-card-using mass shooters per year: 4 (Sorkin's article identified 8 such murderers over 11 years, so 4 per year is a much higher estimate that is again maximally generous to Sorkin.)

Running these inputs through the math above, we find:

  • Given the 1% false negative rate we've assumed, the system will flag ~4 true positives per year.
  • Given the population of 25 million purchasers and a false positive rate of 1%, the system will flag 250,000 false positives per year.
  • Therefore for every true positive, there will be 62,500 false positives. In other words: when the system raises an alarm, there is a 99.9984% chance that the system is wrong.

Remember also that our input numbers are unrealistically generous. False positive and false negative rates around 1% are achievable in tightly defined circumstances. Physics models, narrow subsets of medical imaging, and the like.In predicting human behavior, 1% error rates are unheard-of. Companies like Facebook get paid billions of dollars to build these systems, and they hire armies of elite PhDs with unlimited resources to do it — and they're regularly on the front page of Sorkin's own newspaper for messing it up. In the real world, double-digit error rates abound.An automatic response might be, "62,500 — or 10x that many, or 100x — false positives will be worth it if the system catches one true positive." But that misses the core problem that we set out to solve: you don't know ahead of time which one of the 62,500 (or 10x that many, or 100x that many) is the real mass murderer. And there is no system that can sort through that haystack without running out of resources (after having consumed all the resources that the system was built to save in the first place). Human reviewers can analyze the ML model's flags, but at the cost of increasing bias and false positives — worsening the very problem they're trying to solve.Base rate neglect is common knowledge in the ML community. One would expect a famous reporter to have learned about it before publishing a front-page story in the New York Times. It's unclear how Sorkin missed this, as it's utterly fatal to his proposal.This would be like writing an article criticizing Amazon for burning jetfuel instead of teleporting packages — interesting to debate, but surreal to do so without ever mentioning that the technology for teleportation doesn't exist.Update: three days after this post, the New York Times ran an article about Facebook's program to automatically detect suicidal posts on the site. The article focused on the dangers of false positives.

Previous
Previous

Second-order thinking about gun rights: handy articles and videos