Welcome to Math Mutation, the podcast where we discuss fun, interesting, or weird corners of mathematics that you would not have heard in school. Recording from our headquarters in the suburbs of Wichita, Kansas, this is Erik Seligman, your host. And now, on to the math.
You’ve probably read or heard at some point about the “replication crisis”, and the related epidemic of scientific fraud, discovered over the past few decades. Researchers at many institutions and universities have been accused of modifying or making up data to support their desired results, after detailed analysis determined that the reported numbers were inconsistent is subtle ways. Or maybe you haven’t heard about this— the major media have been sadly deficient in paying attention to these stories. Most amusingly, earlier this year, such allegations were made against famous Harvard professor Francesca Gino, who has written endless papers on honesty and ethics.
Usually these allegations come examining past papers and their associated datasets, and performing various statistical tests to figure out if the numbers have a very low probability of being reasonable. For example, in an earlier podcast we discussed Benford’s Law, a subtle property of leading digits in large datasets, which has been known for many years now. But all these complex tests have overlooked a basic property of data that, in hindsight, seems so simple a grade-school child could have invented it. Finally published in 2016 by Brown and Heathers, this is known as the “granularity-related inconsistency of means” test, or GRIM test for short.
Here’s the basic idea behind the GRIM test. If you are averaging a bunch of numbers, there are only certain values possible in the final digits, based on the value you are dividing into the sum of the numbers. For example, suppose I tell you I asked 10 people to rate Math Mutation on a scale of 1-100, with only whole number values allowed, no decimals. Then I report the average result as “95.337”, indicating the incredible public appreciation of my podcast. Sounds great, doesn’t it?
But if you think about it, something is fishy here. I supposedly got some integer total from those 10 people, and divided it by 10, and got 95.337. Exactly what is the integer you can divide by 10 to get 95.337? Of course there is none— there should be at most one digit past the decimal when you divide by 10! For other numbers, there are wider selections of decimals possible; but in general, if you know you got a bunch of whole numbers and divided by a specific whole number, you can determine the possible averages. That’s the GRIM test— checking if the digits in an average (or similar calculation) are consistent with the claimed data. What’s really cool about this test is that, unlike the many statistical tests that check for low probabilities of given results, the GRIM test is absolute: if a paper fails it, there’s a 100% chance that its reported numbers are inconsistent.
Now you would think this issue is so obvious that nobody would be dumb enough to publish results that fail the GRIM test. But there you would be wrong: when they first published about this test, Brown and Heathers applied it to 71 recent papers from major psychology journals, and 36 showed at least one GRIM failure. The GRIM test also played a major role in exposing problems with the famous “pizza studies” at Cornell’s Food and Brand Lab, which claimed to discover surprising relationships between variables such as the price of pizza, size of slices, male or female accompaniment, and the amount eaten. Sounds like a silly topic, but this research had real-world effects, leading to lab director Brian Wansink’s appointment to major USDA committees & helping to shape US military “healthy eating” programs. Wansink ended up retracting 15 papers, though insisting that all the issues were honest mistakes or sloppiness rather than fraud. Tragically, humanity then had to revert to our primitive 20th-century understanding of the nature of pizza consumption.
Brown and Heathers are careful to point out that a GRIM test failure doesn’t necessarily indicate fraud. Perhaps the descriptions of research methods in a particular paper ignore some detail, like a certain number of participants being disqualified for some legitimate reason and reducing the actual divisor, that would change the GRIM analysis. In other cases the authors have offered excuses like mistakes by low-paid assistants, simple accidents with the notes, typos, or other similar boo-boos short of intentional fraud. But the whole point of scientific papers is to convincingly describe the results in a way that others could reproduce it— so I don’t think these explanations fully let the authors off the hook. And these tests don’t even include the many cases of uncooperative authors who refuse to send researchers like Brown and Heathers their original data. Thus it seems clear that the large number of GRIM failures, and retracted papers as a result of this and similar tests, indicate a serious problem with the way research is coordinated, published, and rewarded in modern academia.
And this has been your math mutation for today.
References:
- https://en.wikipedia.org/wiki/GRIM_test
- https://jamesheathers.medium.com/the-grim-test-a-method-for-evaluating-published-research-9a4e5f05e870
- https://jamesheathers.medium.com/the-grim-test-further-points-follow-ups-and-future-directions-afd55ff67bb0#.vmgjvdvkf
- https://www.npr.org/2023/06/26/1184289296/harvard-professor-dishonesty-francesca-gino
- https://en.wikipedia.org/wiki/Benford%27s_law
- https://www.vox.com/science-and-health/2018/9/19/17879102/brian-wansink-cornell-food-brand-lab-retractions-jama