Sunday, 16 March 2014

Manufacturing significance

What if I could produce an experiment that concluded that listening to an old song could make you younger? Not feel younger, but be younger. Impossible, of course, but the story of how this can be achieved is a great example of how easy it is to produce statistically significant findings in science. All you need is enough 'wriggle room' in the data and a pre-conceived notion of what the results will be. Like ghosts in The Sixth Sense, scientists often only see what they want to see.

Scientific findings should be objective, reliable and reproducible, and divorced from the beliefs and expectations of those doing the science. Theoretically, science is objective. In practice, it ain’t. It’s all in the way the experiments are done (the set up, the variables, the analysis, and the reporting). We look for results that conform with what we expect (or hope) to find, and when we see them (amongst the non-supporting results) we tend to believe them, pick them out and hold them up (report them) as proof of what we hypothesised.

I am not necessarily talking about fraud; I am talking about bias, which is worse because you probably don’t even know you are doing it.

The study reported below shows not only how it is possible, but also how easy it is to produce significant, supportive studies, not by torturing the data until it tells you what you want, but by gently massaging the data into the shape that you expected it to form. Manufacturing significance in this ‘gentle’ way is easy, and those who base their decisions on scientific findings need to be aware of this.

How ‘significance’ works
In the world of science we test things, and make conclusions based on the results of those tests. From this, knowledge is gained, which informs future developments and tests (experiments). That’s how it works and that is what has got us here today, science-wise.

The things we test are hypotheses, and our decision to reject them is based on how likely they were to occur by chance anyway (without our test treatment). For example, if an experimental finding is highly unlikely to have occurred by chance (say, 1 in 100,000), then we would reject the 'null hypothesis', and attribute the ‘significant’ finding to whatever it was we were studying, the test treatment, the new drug, the operation or whatever – how else do we explain those improbable results?

The importance of ‘significance’
If we are going to make decisions about rejecting hypotheses or not, we need a pre-defined cut-off. This is usually set at a probability of p = 0.05 (a 1 in 20 chance). If the results are less than 5% (0.05) likely to have occurred by chance, we say that they weren’t due to chance, and assume that they were due to our wonder drug (or whatever it is we are testing). This cut-off is the significance level, and this method of decision making is so important that in the academic world, when you use the word ‘significant’, it means that you are referring to statistical significance. And statistical significance is what we are all after. Finding a p value of 0.1 from your statistical test usually leads to disappointment, and a p value of 0.50 (say) means that you need to get back to the drawing board because your treatment doesn’t work (as your results had a 50:50 chance of occurring if your treatment did absolutely nothing).

Furthermore, your paper will not be seen as important, innovative or any kind of discovery, and it is less likely to be published (or even submitted for publication) if your p value is not <0.05. The failure to reach significance is not really a failure, it's just an objective result, but you can see how some people would consider it such.

Manufacturing significance
Note: achieving a p value of less than 0.05 does not necessarily mean that your treatment worked. Even if the treatment has no effect, you will still reach statistical significance 5% of the time (by definition). But what if you could somehow make it more likely than that to get the magic p value? What if you had a better than even chance of getting a p value of 0.05, even when there was no effect. Well you need wait no longer because now you can, by exploiting the ambiguities in research, and making conscious choices about included values, data grouping, covariate analyses etc. etc. You will find that if you play around with any data long enough, a p value of <0.05 will pop up, about 5% of the time. Once you find it, you publish that analysis and discard the others. Presto, you just proved the existence of something that was never there.

P-hacking, also known as data-dredging, fishing or significance-chasing, means exploiting ambiguities inherent in any research in order to arrive at a different results (with different p values), and then choosing to report only the significant result and the method you used to achieve that particular result; the rest is swept under the carpet (or subconsciously considered erroneous), and what you present to the world (what you publish) is a neat single analysis of your neat data set, and a significant result with a p value of less than 0.05.

The study that shows you how
The study I am using for this example is here, but you don’t have to read it; I will break it down for you. The researchers set up two experiments that they actually performed on students at the University of Pennsylvania.

First, the findings of a 2-part experiment are reported, as they would appear in a journal. Then the authors explain what really went on behind the scenes, and then give the exact same report, but longer, because they add the information needed to show how they REALLY got the results. The experiments they report are as follows:

Part 1
This study aimed to see if listening to a children’s song made you feel older. 30 students were randomly assigned to listen either to “Kalimba” (the bland music that plays when Windows 7 opens) or “Hot Potato” by (the children’s band) The Wiggles. Participants were asked to fill out an ostensibly unrelated survey to ask how old they felt. Father’s age was used to control for their baseline age, and differences between the two groups were tested.  They found that the group that listened to Hot Potato felt significantly (p < 0.05) older than the group that heard Kalimba.

No surprises? Makes sense? OK, but that was just the set up for Part 2.

Now bear with me for a moment. Note that they used father’s age to control for age, instead of real age. This is important for study 2, and is a reasonable way of working out how old (on average) a group of people are. For example, if the average father’s age of people in group 1 was 40, and that of group 2 was 50, group 2 would (on average) be about 10 years older than the people in group 1. Get it?

Part 2
Part 2 investigated whether listening to a song about older age makes people actually younger. (Note: this is clearly impossible, but that’s what makes it brilliant). The results for Part 2 (they call "Study 2") are given here:

'Using the same methods as Study 1, we asked 20 [students] to listen to either “When I’m Sixty-Four” by The Beatles, or “Kalimba”. Then, in an ostensibly unrelated task, they indicated their birth date and their father’s age. We used father’s age to control for baseline age across participants.
According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted mean = 20.1 years) rather than to “Kalimba” (adjusted mean = 21.5 years), p = .040.'

How they did it
Every thing they did could be considered reasonable in a scientific study of this type. Their point is that there are too many “researcher degrees of freedom” (I wish I’d thought of that term). That means there are too many ways in which they can influence the outcome. Having so many “degrees of freedom” generates many possible outcomes, and the probability of finding the outcome you want (or expect) amongst so many different analyses goes up. The magic is that the probability of getting the one particular outcome that you use, alone (if that was all you did, and there were no other possible ways of doing it) is likely to be very low, and that is the “p value” (probability) that you report in the study. Some of those degrees of freedom (flexibility) are listed here:

1. “Flexible” sample size
The authors recruited 20 participants, and then ran the tests. They then planned to add another 10 participants if the study didn’t show what they expected, and test it again. Doesn’t sound like much, but this is really two separate analyses; two bites at the cherry (doubling the chance of a wrong [false positive] result)

2. Adding variables
The authors actually had the patients listen to a third song in Part 2 (not reported), Hot Potato. Adding another variable allowed for more ways to analyse the data, this time tripling your chances of getting a ‘false positive’ result.
They also recorded mother’s age as a possible controlling variable for age (as well as father’s age). Another chance to get a false positive result if 'father's age' didn't work. They also recorded and analysed many other variables that were not mentioned in the initial report.

The fully reported results of Part 2
The original report (as above) is in bold text. The rest is the added information that allows detection of the possible bias in the methods.

Using the same method as in Study 1, we asked 20 34 University of Pennsylvania undergraduates to listen only to either “When I’m Sixty-Four” by The Beatles or “Kalimba” or “Hot Potato” by the Wiggles. We conducted our analyses after every session of approximately 10 participants; we did not decide in advance when to terminate data collection. Then, in an ostensibly unrelated task, they indicated only their birth date (mm/dd/yyyy) and how old they felt, how much they would enjoy eating at a diner, the square root of 100, their agreement with “computers are complicated machines,” their father’s age, their mother’s age, whether they would take advantage of an early-bird special, their political orientation, which of four Canadian quarterbacks they believed won an award, how often they refer to the past as “the good old days,” and their gender. We used father’s age to control for variation in baseline age across participants.
According to their birth dates, people were nearly a year-and-a-half younger after listening to “When I’m Sixty-Four” (adjusted mean = 20.1 years) rather than to “Kalimba” (adjusted mean = 21.5 years), p = .040. Without controlling for father’s age, the age difference was smaller and did not reach significance (means = 20.3 and 21.2, respectively) p = .33.

The simulation
They ran a simulation of studies to determine the effect of adding a variable, adding a covariate, running 3 different analyses instead of 2, and adding 10 more patients to their sample size of 20. They found that if you did all 4 of these seemingly harmless things, the chance of getting a p value of less than 0.05 when there is no real effect is about 61%.

The bottom line
Without making up data (fraud); just by playing around with the analysis you do (how many patients you include, what variables you include, and which ones you adjust for, etc.) it is very easy to generate a statistically significant result (something seemingly unlikely to have occurred by chance, and therefore supporting your hypothesis), even when your treatment had no effect. It would be even easier if your treatment had a small effect, to make it look more effective. The answer to this is to have honest and complete reporting of every variable used, assumptions made, data excluded and analyses performed. And for scientists to realise their own biases.


A famous example (real life this time)
This has been used by Richard Feynman to show how scientists can produce biased results by trying to make them fit their expectations, rather than the data. He uses it to warn scientists against fooling themselves. Here it is (possibly not his voice) on YouTube. It is from a larger speech (also worth reading); a 1974 Caltech commencement address, which can be found here.

9 comments:

  1. Thanks for this great post! The physicist Richard Feynman put it perfectly: "The first principle [of good science] is that you must not fool yourself—and you are the easiest person to fool." Now I hope you might do the same analysis on a real medical study.

    For those who are interested in an example of bias hiding in plain sight, here's just one of many: Healy, D. Journal of Psychiatry & Neuroscience, vol 28, p 331

    ReplyDelete
  2. Great discussion of a great demonstration!

    I'm pretty sure that's not Feynman's voice on the linked video, however. The text is from his famous CalTech graduation speech in (I think) 1974, in which he discusses "Cargo Cult Science" (you can hear the end of that part when he talks about why "the planes don't land"). But the voice doesn't sound anything like him. It sounds like a much younger man, and Feynman had a heavy Brooklyn accent his whole life. I think the creator was just reading Feynman's words onto the video.

    ReplyDelete
    Replies
    1. Thanks. I was suspicious when I saw there were only 42 views, and when it didn't sound like him. I have amended the blog post to reflect this but kept the link, because it is explained well, even if it is not from the man himself. Full text of the Caltech address has also been added (http://neurotheory.columbia.edu/~ken/cargo_cult.html)

      Delete
    2. Great article. I've heard 'rumblings' that the idea of a threshold of statistical significance might be scrapped, and replaced with a 'however high you can manage' system; but i'm not sure whether Journals would go along with that. They'd have to decide on their own standards of statistical significance... 0.1 / 0.05 / 0.02... for a paper to be considered for publication. Then again, I don't work in Medicine, so what do I know!

      FYI: The info section links to the original video, as a "source video", which was read aloud by YouTuber C0nc0rdance, of whom I was already familiar. I think he has a background in Biology, and he is a fellow Skeptic / Rationalist.

      Delete
    3. Interesting that you say that the journals consider articles for publication based on the significance level. This is a form of publication bias and there is conflicting evidence about this. Many people see the over-representation of low p values (my p value paper is coming out son) as evidence for publication bias. I think that publication bias, at least from the journals, is overestimated, and that most of the bias comes from the researchers who either don't submit the non-significant papers, or massage them into becoming significant.

      Delete
  3. A failure to achieve significance may in itself be significant. I am attracted to black swan events, things that occur that lie far outside of model predictions. A treatment may appear helpful in half the population tested but, 1 in 10,000 cases it causes heart failure. The solution is NOT to hide the outlier. Unfortunately, with money on the line that may be precisely what the experimenters will do. Just who do you think you're fooling?

    ReplyDelete
    Replies
    1. Thanks, and yes: 1 in 10,000 is important if it something serious like death, and if that number is reliable. NASA worked on a probable failure rate of 1 in 100,000 for their shuttle launches, but when the shuttle exploded on take off, they realised that the managers were only seeing what they wanted to see, and that the real risk of failure was more like 1 in 100. Feynman came to the rescue on there as well - they should make a movie about that one. See here: http://en.wikipedia.org/wiki/Rogers_Commission_Report#Role_of_Richard_Feynman

      Delete
  4. Thanks, great article. The Richard Feynman story is also really interesting. It seems though that there has been a movie made about his role in the Challenger disaster….. http://www.imdb.com/title/tt2421662/. I hope its available somewhere on the internet.

    ReplyDelete
    Replies
    1. Always thought that would be a great movie. Now we have William Hurt as Richard Feynman - can't wait.

      Delete