This article is about two common problems with “statistical significance” in medical research. Both problems are particularly rampant in the study of massage therapy, chiropractic and alternative medicine in general, and are wonderful examples of why science is hard and “why most published research findings are false”:
Stats are hard and scary, of course — everyone knows that. But there will be comic strips, funny videos, and diagrams. I’ll try to make it worth your while.
Many scientific papers conclude that their findings are “significant” without actually reporting their numbers. Almost any reader but a statistician will be fooled, because “significant” sounds so much like “a big deal.” But statistical significance is a technical thing, meaning only that the result of the study probably isn’t a fluke. It is possible and common to have clinically trivial results that are nonetheless statistically significant. And it’s also possible for results to be technically statistically significant and yet still have a better chance of being a coincidence than you might think.
So imagine a treatment for pain that is supposedly “proven” to have an effect, but it’s a tiny effect. It only reduces pain slightly. You can take that data to the bank. It’s real! It’s statistically significant. Except there’s actually still a 5% chance that it’s a mistake. But that’s just enough to be “statistically significant.” Technically.
Now imagine that you understand this, and someone tells you the next day that this pain treatment is “proven”! Frustrating! It’s actually fairly common to see such “significant” results: clinically boring and possibly not even real.
Just because a published paper presents a statistically significant result does not mean it necessarily has a biologically meaningful effect.
Science Left Behind: Feel-Good Fallacies and the Rise of the Anti-Scientific Left, Alex Berezow & Hank Campbell
If you torture data for long enough, it will confess to anything.
Ronald Harry Coase
The p-value is a just percentage. So P > 0.05 simply means “less than a 5% chance that the results were a coincidence.”
Head hurting already? Time for the stick people to take over this tutorial.
A “5% chance of coincidence” is actually a fairly strong chance of a coincidence. That’s one in twenty. That can happen. Something with a one in twenty chance of happening each day is going to happen more than once per month.
Randall Munroe hides an extra little joke in all his comic strips (mouse over his comic strips on his website, wait a moment, and they are revealed). This time it was:
'So, uh, we did the green study again and got no link. It was probably a--' 'RESEARCH CONFLICTED ON GREEN JELLY BEAN/ACNE LINK; MORE STUDY RECOMMENDED!'
A great deal of crap science is presented in exactly this way. It’s one of the main ways that “studies show” a lot of things help pain that aren’t actually do no such thing. It’s one of the easiest ways for the “controversy” over many alternative treatments can be extended: by citing “significant” evidence of benefit, with data exactly as absurd as in the comic strip above.
It’s actually statistically normal for the occasional study to make a bad treatment look good … by freak chance.
A classic real world example of “statistically significant but clinically trivial” is the supposedly proven benefit of chiropractic adjustment: how much benefit, exactly? “Less than the threshold for what is clinically worthwhile,” as it turns out, according to Nefyn Williams, author of a 2011 paper in International Musculoskeletal Medicine.1 I have been pointing out that spinal adjustment benefits seem to be real-but-minor for many years now, and I explore that evidence in crazy detail in my low back pain and neck cricks tutorials. It’s a big topic, with lots of complexity.
Williams’ paper offers an oblique perspective that is quite different and noteworthy: his paper is not about chiropractic adjustment, but about the concept of clinical significance itself. There are various ways of measuring improvement in scientific tests of treatments, and, as Williams explains, “when an outcome measure improves by, say, five points it is not immediately apparent what this means.” How much improvement matters? After explaining and discussing various proposed standards and methods, Williams needed a good example to make his point. It’s quite interesting that he picked spinal manipulative therapy.
Chondroitin sulfate is a “nutraceutical” — a food-like nutritional supplement that is supposedly “good for cartilage” (because it is major component of cartilage). It has been heavily studied, but there has never been any clear good scientific news about it, and it bombed a particularly large and good quality test in 2006 (see Clegg).
So it was a bit hard to believe my eyes when I read the summary of a 2011 experiment claiming that chondroitin sulfate “improves hand pain.”2 Really?!
No, not really. On a 100mm VAS (a pain scale, “visual analogue scale”), the treatment group was 8.77mm happier with their hands. With a p=.02 (a middlin’ P-value, neither high nor low). So basically what the researchers found is a chance that chondroitin makes a small difference in arthritis pain. It’s not nothing, but it is an incredibly unimpressive result — pretty much the definition of clinically insignificant. The authors’ interpretation is like taking the dog to the end of the driveway and saying you took him for a walk. Technically true ...
So that is a lovely demonstration of the abuse of statistical significance!
The first significance “problem” is almost like a trick or truth bending. It works well and fools a lot of people — sometimes, I think, even the scientists or doctors who are using it — because the usage of the term “significant” is often technically correct and literally true, but obscures and diverts attention away from the whole story. Thus it is more like a subtle and technical lie of omission than an actual error.
That’s pretty bad! But it’s not the half of it. It gets much worse: a lot of so-called significant results aren’t even technically correct. Problem #2 is an actual error — something that would get you a failing grade in a basic statistics course.
a stark statistical error so widespread it appears in about half of all the published papers surveyed from the academic neuroscience research literature.
Dr. Steven Novella also wrote about it for ScienceBasedMedicine.org recently, adding that
there is no reason to believe that it is unique to neuroscience research or more common in neuroscience than in other areas of research.
And it is not. Dr. Christopher Moyer is a psychologist who studies massage therapy:
I have been talking about this error for years, and have even published a paper on it. I critiqued a single example of it, and then discussed how the problem was rampant in massage therapy research. Based on the Nieuwenhuis paper, apparently it’s rampant elsewhere as well, and that is really unfortunate. Knowing the difference between a within-group result and a between-groups result is basic stuff.
Clinical trials are all about comparing treatments. To be considered effective, a real treatment has to work better than a fake one — a placebo. A drug must produce better results than a sugar pill. If that difference is big enough, it is “statistically significant.” There are a lot of details, but that’s the beating heart of a fair scientific test: can the treatment beat a fake?
What you can’t do is just compare the treatment to nothing at all and say, “See, it works: huge difference! Huge improvement over nothing!” The problem is that both effective medicines and placebos can beat nothing — it just doesn’t mean much until you know the treatment can also trounce a placebo. On its own, a “statistically significant” difference between treatment and nothing at all is the sound of one hand clapping. A meaningful comparison has to be a statistical ménage à trois, comparing all three to each other (analysis of variance, or ANOVA).
The error is the failure to do this. And, shockingly, Nieuwenhuis et al reported that more than half of researchers were making this mistake: comparing treatments and placebos to nothing, but not to each other.
Studies of massage therapy (and others, like chiropractic4) are particularly plagued by this error. Why? Because massage is so much “better than nothing.” The size of that difference looms large, and so it’s all too easy to mistake it for the one that matters — and fail to even compare the treatment to a placebo. And it’s really hard to come up with a meaningful placebo. It’s notoriously difficult to give a patient a fake massage. (They catch on.)
Statistics does not care about these difficulties: you still can’t compare massage to nothing, stop there, and call the difference “significant.” You still have to do your ANOVA, and massage still has to beat some kind of placebo before it can be considered more effective than pleasant grooming.
This research problem is not limited to massage, but massage is probably be the single best example of it. It crops up when you’re studying any treatment that involves a lot of interaction. The more interaction, the worse the problem gets. It’s a big deal in massage research because massage involves a lot of interaction, much of which is pleasant and emotionally engaging. It’s notoriously difficult to give a patient a fake massage. (They catch on.)Interaction with a friendly health care provider has a lot of surprisingly potent effects: people react strongly and positively to compassion, attention, and touch. The problem is that those benefits have nothing to do with any specific “active ingredient” in a massage. Grooming is just nice. It’s like pizza: even when it’s bad, it’s pretty good.
Much of the good done by therapists of all kinds is attributable to potent placebos driven by their complex interactions with patients, and not by anything in particular that they are doing to the patient. To find out how well a therapy works, it must be compared to sham treatments which are as much like the treatment as possible. This is hard to do, and it has rarely been done well. It’s much more typical to compare therapy to something too lifeless and “easy to beat,” to much like comparing it to nothing at all instead of a real placebo. And there’s the difference error: comparison to the wrong thing, statistical significance of the wrong difference.