Author: Peter Orchard
Editors: Theresa Mau, Bryan Moyers, Alisha John
Almost 100 years ago, the English biologist and statistician Dr. Ronald Fisher was enjoying a cup of tea with his Cambridge University colleagues when another biologist, Dr. Muriel Bristol, made an interesting claim. Bristol asserted that just by tasting her tea, she could infer whether the tea was poured into the cup before the milk, or the milk before the tea.
The experiment that supposedly followed has become one of the best-known anecdotes in modern statistics. Bristol was presented with eight full teacups: in four, tea had been poured before milk, and in the other four milk had been added first.
If I were to tell you that she correctly classified all 8 cups of tea, how convinced of Bristol’s claim would you be? If she correctly guessed 6 of the 8 cups, would you still believe her? Fisher addressed this by framing the question as follows:
If Bristol’s claim is false (that is, she cannot tell the difference), then the results of the experiment are attributable to chance only. Bristol, who knows that half of the cups were poured tea-first, has a 50% chance of guessing each cup correctly. The probability of guessing all 8 cups correctly is 1 in 70; the probability of getting 6 right is 16 in 70; and so on (you can see the math here).
Let’s imagine that Bristol correctly guessed 6 cups. (The results of the actual experiment are in some dispute, but many sources claim that she got all 8 correct). Fisher then asked: what is the probability, assuming that Bristol has no special tea-sensing powers, that she would guess at least 6 cups correctly? By summing the probabilities of all outcomes in which 6 or more guesses are correct, we can calculate the probability as 17/70, or 0.243.
That “0.243” is what is known as the p-value for this experiment. The p-value, now one of the most commonly used tools for weighing evidence, represents the probability of observing results at least as extreme as the results obtained, assuming that the results are due to chance alone. The smaller the p-value, the more unlikely the result would be if only chance were at play, and the more likely we are to conclude that the results aren’t explained by chance alone.
Perhaps the most widespread myth about the p-value is the idea that it represents the probability that the results were produced by chance alone. In fact, p-values are calculated under the assumption that the results were produced by chance. An extremely low p-value means that the result we obtained would be exceptionally rare under those conditions, forcing us to consider other explanations.
What is the magic number?
In the above example, we could use the p-value to make a judgment about Bristol’s claim—either she can discern the order in which the milk and tea were added, or she cannot. But at what point do we decide to believe her claim, or any scientific claim? When do we say “this is so unlikely that we refuse to believe it’s due to chance”? The truth is that there isn’t any magic p-value. A p-value is nothing more than a particular shade of grey.
The history of the p-value is surprisingly complex. Somewhere along the line, p = 0.05 was crowned the “default” p-value threshold, below which one rejects the idea that random chance alone is responsible for an outcome. Over the years, an increasing fixation on arbitrary p-value thresholds has led to public calls for reform and a certain amount of cynicism about the use of statistics in science. The American Statistical Association points out that “‘bright-line’ rules (such as ‘p < 0.05’) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making. A conclusion does not immediately become ‘true’ on one side of the divide and ‘false’ on the other”. This fixation on thresholds incentivizes researchers to consider dirty statistical tricks such as p-hacking.
Lies, damned lies, and statistics
P-hacking is the statistics equivalent of cherry-picking. Data from real experiments are messy, and scientists must make tough decisions about how to analyze them. But changing how a dataset is filtered and analyzed changes the resulting p-value. P-hacking is examining many different variables and running different statistical analyses until an acceptable p-value is achieved, essentially manufacturing the desired result.
Imagine, for example, two political campaigns analyzing crime data. The incumbent party would like to show that violent crime in the city has decreased while they’ve been in power; the opposition party would like to prove the opposite. Using the same dataset (e.g., police reports), both sides may be able to produce statistical support for their claims. The incumbent party may look only at homicides and attempted homicides, while the opposition party includes arrests resulting from verbal threats in their calculations. By making different filtering decisions they may arrive at different conclusions. For an interactive example of p-hacking, check out this demonstration by the statisticians at FiveThirtyEight.
P-hacking is a big problem. But it also reflects a greater truth: even an unbiased researcher must make decisions in data analysis that may alter the conclusion. That’s why it’s essential to clearly and fully explain how a particular dataset was analyzed.
Effect size and context matter
Even when the data are straightforward, context influences interpretation. Imagine, for example, a drug that is reported to extend a person’s lifespan without any side effects. Experiments consistently show that individuals taking the drug live longer than those who don’t. Would you take the drug?
What if the drug costs $50,000 and only extends life by two weeks?
In science, as in life, effect size (e.g., exactly how much longer you’ll live if you take the drug) and context (for example, costs vs. benefits) are important. A p-value is a reflection of statistical significance, not of effect size or the real-world significance of a finding.
Assuming (is) the worst
The items discussed above—arbitrary p-value thresholds, p-hacking, and effect size/context—are the most commonly-cited p-value pitfalls, but there are additional points to remember.
One is that when you calculate a p-value, you make implicit assumptions. If your friend claimed he could predict the outcome of coin tosses, you might flip a coin 100 times, count the number of times that he correctly predicts the result, and then calculate the p-value, assuming that the coin is fair (that is, the probabilities of heads and tails are equal). If this assumption is wrong, then your p-value may be misleading. For example, if your friend had knowingly given you a coin that always came up heads; in that case it would be entirely unimpressive for him to guess all 100 flips correctly.
Furthermore, concluding that a result is not due to chance alone does not mean that it can be attributed to any particular cause. In our tea-tasting story, even if Bristol classifies all the cups correctly, we can only say that there is evidence that she can distinguish between the cups; we cannot say how she may be doing so. (E.g., perhaps she can taste the difference, or perhaps there is some visual cue).
Statistics is not a magic wand: it cannot turn incorrect assumptions into correct conclusions. If you start in the wrong place, you’ll likely end up in the middle of nowhere.
The future of the p-value
As long as the context and the underlying assumptions are clear, and it is used in combination with other metrics (such as effect size), we can avoid the problems associated with p-values and use them as an informative measure. However, the frequent misuse of p-values has led to conversations (here and here) within the scientific community about when and how p-values should be employed, and how to guard against common problems. Now that scientific fields are collecting more data, more quickly, it is ever easier to create spurious correlations, making p-reform more urgent every year.
Last year the journal Basic and Applied Social Psychology went so far as to ban the p-value from their articles. Some academics have proposed pre-analysis plans, in which the methods for processing data and determining statistical significance are registered before the data are collected, to combat p-hacking (of course, this comes with its own set of problems).
There is no perfect, universal solution to these problems; fortunately, science is designed to be self-correcting. Though the arc of experimentation is long, continually re-evaluating our data—and our statistical tools—points science towards correctness.
About the author
Peter is a doctoral student in the Bioinformatics program at the University of Michigan. His research focuses on the role of gene regulation in type 2 diabetes. He is generally interested in the use of DNA sequencing and similar technologies to understand and treat disease. Peter holds a bachelor’s degree from the University of Washington and a master’s from the Ludwig Maximilian University of Munich, and enjoys spending time outside when he’s not at the university.
Milk and tea: By User: MarkBTomlinson – Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=25831283