What if I could produce an experiment that concluded that
listening to an old song could make you younger? Not feel younger, but be
younger. Impossible, of course, but the story of how this can be achieved is a
great example of how easy it is to produce statistically significant findings
in science. All you need is enough 'wriggle room' in the data and a pre-conceived notion of what the results will be. Like ghosts in The Sixth Sense,
scientists often only see what they want to see.
Scientific findings should
be objective, reliable and reproducible, and divorced from the beliefs and
expectations of those doing the science. Theoretically, science is objective. In practice, it
ain’t. It’s all in the way the experiments are done (the set up, the variables,
the analysis, and the reporting). We look for results that conform with what we expect (or hope) to find, and when we see them (amongst the non-supporting results)
we tend to believe them, pick them out and hold them up (report them) as proof
of what we hypothesised.
I am not necessarily talking about fraud; I am talking about
bias, which is worse because you probably don’t even know you are doing it.
The study reported below shows not only how it is possible, but
also how easy it is to produce significant, supportive studies, not by
torturing the data until it tells you what you want, but by gently massaging
the data into the shape that you expected it to form. Manufacturing significance
in this ‘gentle’ way is easy, and those who base their decisions on scientific
findings need to be aware of this.
How ‘significance’
works
In the world of science we test things, and make conclusions
based on the results of those tests. From this, knowledge is gained, which
informs future developments and tests (experiments). That’s how it works and that is what has
got us here today, science-wise.
The things we test are hypotheses, and our decision to
reject them is based on how likely they were to occur by chance anyway (without
our test treatment). For example, if an experimental finding is highly
unlikely to have occurred by chance (say, 1 in 100,000), then we would reject
the 'null hypothesis', and attribute the ‘significant’ finding to whatever it was we
were studying, the test treatment, the new drug, the operation or whatever –
how else do we explain those improbable results?
The importance of
‘significance’
If we are going to make decisions about rejecting hypotheses
or not, we need a pre-defined cut-off. This is usually set at a probability of p
= 0.05 (a 1 in 20 chance). If the results are less than 5% (0.05) likely to
have occurred by chance, we say that they weren’t due to chance, and assume
that they were due to our wonder drug (or whatever it is we are testing). This
cut-off is the significance level, and this method of decision making is so
important that in the academic world, when you use the word ‘significant’, it
means that you are referring to statistical significance. And statistical
significance is what we are all after. Finding a p value of 0.1 from your
statistical test usually leads to disappointment, and a p value of 0.50 (say)
means that you need to get back to the drawing board because your treatment
doesn’t work (as your results had a 50:50 chance of occurring if
your treatment did absolutely nothing).
Furthermore, your paper will not be seen as important,
innovative or any kind of discovery, and it is less likely to be published (or even
submitted for publication) if your p value is not <0.05. The failure to
reach significance is not really a failure, it's just an objective result, but you
can see how some people would consider it such.
Manufacturing
significance
Note: achieving a p value of less than 0.05 does not
necessarily mean that your treatment worked. Even if the treatment has no
effect, you will still reach statistical significance 5% of the time (by
definition). But what if you could somehow make it more likely than that to get
the magic p value? What if you had a better than even chance of getting a p
value of 0.05, even when there was no effect. Well you need wait no longer
because now you can, by exploiting the ambiguities in research, and making
conscious choices about included values, data grouping, covariate analyses etc.
etc. You will find that if you play around with any data long enough, a p
value of <0.05 will pop up, about 5% of the time. Once you find it, you
publish that analysis and discard the
others. Presto, you just proved the existence of something that was never
there.
P-hacking, also known as data-dredging, fishing or
significance-chasing, means exploiting ambiguities inherent in any research in
order to arrive at a different results (with different p values), and then
choosing to report only the significant result and the method you used to
achieve that particular result; the rest is swept under the carpet (or
subconsciously considered erroneous), and what you present to the world (what
you publish) is a neat single analysis of your neat data set, and a significant
result with a p value of less than 0.05.
The study that shows
you how
The study I am using for this example is here,
but you don’t have to read it; I will break it down for you. The researchers set
up two experiments that they actually performed on students at the University
of Pennsylvania.
First, the findings of a 2-part experiment are reported, as
they would appear in a journal. Then the authors explain what really went on
behind the scenes, and then give the exact same report, but longer, because
they add the information needed to show how they REALLY got the results. The experiments they report are as follows:
Part 1
This study aimed to see if listening to a children’s song
made you feel older. 30 students were
randomly assigned to listen either to “Kalimba” (the bland music that plays
when Windows 7 opens) or “Hot Potato” by (the children’s band) The Wiggles.
Participants were asked to fill out an ostensibly unrelated survey to ask how
old they felt. Father’s age was used to control for their baseline age, and
differences between the two groups were tested. They found that the group that listened to Hot
Potato felt significantly (p <
0.05) older than the group that heard Kalimba.
No surprises? Makes sense? OK, but that was just the set up
for Part 2.
Now bear with me for a moment. Note that they used father’s
age to control for age, instead of real age. This is important for study 2, and
is a reasonable way of working out how old (on average) a group of people are.
For example, if the average father’s age of people in group 1 was 40, and that
of group 2 was 50, group 2 would (on
average) be about 10 years older than the people in group 1. Get it?
Part 2
Part 2 investigated whether listening to a song about older
age makes people actually younger.
(Note: this is clearly impossible, but that’s what makes it brilliant). The
results for Part 2 (they call "Study 2") are given here:
'Using the same methods
as Study 1, we asked 20 [students] to listen to either “When I’m Sixty-Four” by
The Beatles, or “Kalimba”. Then, in an ostensibly unrelated task, they
indicated their birth date and their father’s age. We used father’s age to control
for baseline age across participants.
According to their
birth dates, people were nearly a year-and-a-half younger after listening to “When
I’m Sixty-Four” (adjusted mean = 20.1 years) rather than to “Kalimba” (adjusted
mean = 21.5 years), p = .040.'
How they did it
Every thing they did could be considered reasonable in a
scientific study of this type. Their point is that there are too many
“researcher degrees of freedom” (I wish I’d thought of that term). That means there are too many ways in which they can influence the outcome. Having so many “degrees
of freedom” generates many possible outcomes, and the probability of finding
the outcome you want (or expect) amongst so many different analyses goes up.
The magic is that the probability of getting the one particular outcome that you use, alone (if
that was all you did, and there were no other possible ways of doing it) is
likely to be very low, and that is
the “p value” (probability) that you report in the study. Some of those degrees
of freedom (flexibility) are listed here:
1. “Flexible” sample size
The authors recruited 20 participants, and then ran the
tests. They then planned to add another 10 participants if the study didn’t show what
they expected, and test it again. Doesn’t sound like much, but this is really
two separate analyses; two bites at the cherry (doubling the chance of a wrong [false positive] result)
2. Adding variables
The authors actually had the patients listen to a third song in Part 2 (not reported), Hot Potato. Adding another variable allowed for more
ways to analyse the data, this time tripling your chances of getting a ‘false
positive’ result.
They also recorded mother’s
age as a possible controlling variable for age (as well as father’s
age). Another chance to get a false positive result if 'father's age' didn't work. They also recorded and analysed many other variables that were not mentioned in the initial report.
The fully reported
results of Part 2
The original report (as above) is in bold text. The rest is
the added information that allows detection of the possible bias in the
methods.
Using the same method as in Study 1, we asked 20 34 University
of Pennsylvania undergraduates to listen only to either “When I’m
Sixty-Four” by The Beatles or “Kalimba” or “Hot Potato” by the Wiggles. We
conducted our analyses after every session of approximately 10 participants; we
did not decide in advance when to terminate data collection. Then, in an ostensibly unrelated task, they
indicated only their birth date
(mm/dd/yyyy) and how old they felt, how much they would enjoy eating at a
diner, the square root of 100, their agreement with “computers are complicated
machines,” their father’s age, their
mother’s age, whether they would take advantage of an early-bird special, their
political orientation, which of four Canadian quarterbacks they believed won an
award, how often they refer to the past as “the good old days,” and their
gender. We used father’s age to control
for variation in baseline age across participants.
According to their birth dates, people were nearly a year-and-a-half
younger after listening to “When I’m Sixty-Four” (adjusted mean = 20.1 years)
rather than to “Kalimba” (adjusted mean = 21.5 years), p = .040. Without controlling for father’s age, the
age difference was smaller and did not reach significance (means = 20.3 and
21.2, respectively) p = .33.
The simulation
They ran a simulation of studies to determine the effect of adding
a variable, adding a covariate, running 3 different analyses instead of 2, and
adding 10 more patients to their sample size of 20. They found that if you did
all 4 of these seemingly harmless things, the chance of getting a p value of
less than 0.05 when there is no real effect is about 61%.
The bottom line
Without making up data (fraud); just by playing around with
the analysis you do (how many patients you include, what variables you include,
and which ones you adjust for, etc.) it is very easy to generate a
statistically significant result (something seemingly unlikely to have occurred
by chance, and therefore supporting your hypothesis), even when your treatment
had no effect. It would be even easier if your treatment had a small effect, to
make it look more effective. The answer to this is to have honest and complete reporting
of every variable used, assumptions made, data excluded and analyses performed. And for scientists to realise their own biases.
A famous example (real life this time)
This has been used by Richard Feynman to show how scientists can produce biased results by trying to
make them fit their expectations, rather than the data. He uses it to warn
scientists against fooling themselves. Here it is (possibly not his voice) on YouTube. It is from a larger speech (also worth reading); a 1974 Caltech commencement address, which can be found here.
Thanks for this great post! The physicist Richard Feynman put it perfectly: "The first principle [of good science] is that you must not fool yourself—and you are the easiest person to fool." Now I hope you might do the same analysis on a real medical study.
ReplyDeleteFor those who are interested in an example of bias hiding in plain sight, here's just one of many: Healy, D. Journal of Psychiatry & Neuroscience, vol 28, p 331
Great discussion of a great demonstration!
ReplyDeleteI'm pretty sure that's not Feynman's voice on the linked video, however. The text is from his famous CalTech graduation speech in (I think) 1974, in which he discusses "Cargo Cult Science" (you can hear the end of that part when he talks about why "the planes don't land"). But the voice doesn't sound anything like him. It sounds like a much younger man, and Feynman had a heavy Brooklyn accent his whole life. I think the creator was just reading Feynman's words onto the video.
Thanks. I was suspicious when I saw there were only 42 views, and when it didn't sound like him. I have amended the blog post to reflect this but kept the link, because it is explained well, even if it is not from the man himself. Full text of the Caltech address has also been added (http://neurotheory.columbia.edu/~ken/cargo_cult.html)
DeleteGreat article. I've heard 'rumblings' that the idea of a threshold of statistical significance might be scrapped, and replaced with a 'however high you can manage' system; but i'm not sure whether Journals would go along with that. They'd have to decide on their own standards of statistical significance... 0.1 / 0.05 / 0.02... for a paper to be considered for publication. Then again, I don't work in Medicine, so what do I know!
DeleteFYI: The info section links to the original video, as a "source video", which was read aloud by YouTuber C0nc0rdance, of whom I was already familiar. I think he has a background in Biology, and he is a fellow Skeptic / Rationalist.
Interesting that you say that the journals consider articles for publication based on the significance level. This is a form of publication bias and there is conflicting evidence about this. Many people see the over-representation of low p values (my p value paper is coming out son) as evidence for publication bias. I think that publication bias, at least from the journals, is overestimated, and that most of the bias comes from the researchers who either don't submit the non-significant papers, or massage them into becoming significant.
DeleteA failure to achieve significance may in itself be significant. I am attracted to black swan events, things that occur that lie far outside of model predictions. A treatment may appear helpful in half the population tested but, 1 in 10,000 cases it causes heart failure. The solution is NOT to hide the outlier. Unfortunately, with money on the line that may be precisely what the experimenters will do. Just who do you think you're fooling?
ReplyDeleteThanks, and yes: 1 in 10,000 is important if it something serious like death, and if that number is reliable. NASA worked on a probable failure rate of 1 in 100,000 for their shuttle launches, but when the shuttle exploded on take off, they realised that the managers were only seeing what they wanted to see, and that the real risk of failure was more like 1 in 100. Feynman came to the rescue on there as well - they should make a movie about that one. See here: http://en.wikipedia.org/wiki/Rogers_Commission_Report#Role_of_Richard_Feynman
DeleteThanks, great article. The Richard Feynman story is also really interesting. It seems though that there has been a movie made about his role in the Challenger disaster….. http://www.imdb.com/title/tt2421662/. I hope its available somewhere on the internet.
ReplyDeleteAlways thought that would be a great movie. Now we have William Hurt as Richard Feynman - can't wait.
Delete