•Sensible advice for aches, pains & injuries

an excerpt from xkcd, geeky web comic

Statistical Significance Abuse

A lot of research makes scientific evidence seem more “significant” than it is

2,300 words, published 2011, updated May 18th, 2014
by Paul Ingraham, Vancouver, Canadabio
I am a science writer, the Assistant Editor of, and a former Registered Massage Therapist with a decade of experience treating tough pain cases. I’ve written hundreds of articles and several books, and I’m known for sassy, skeptical, referenced analysis and a huge bibliography. I am a runner and ultimate player, and live in beautiful downtown Vancouver, Canada. • full bioabout

SHOW SUMMARY Many study results are called “stastically significant,” giving unwary readers the impression of good news. But it’s misleading: statistical significance means only that the measured effect of a treatment is probably real (not a fluke). It says nothing about how large the effect is. Many small effect sizes are reported only as “statistically significant” — it’s a nearly standard way for biased researchers to make it found like they found something more important than they did.

This article is about two common problems with “statistical significance” in medical research. Both problems are particularly rampant in the study of massage therapy, chiropractic and alternative medicine in general, and are wonderful examples of why science is hard and “why most published research findings are false”:

  1. confusing statistical with clinical significance
  2. reporting statistical significance of the wrong thing

Stats are hard and scary, of course — everyone knows that. But there will be comic strips, funny videos, and diagrams. I’ll try to make it worth your while.

60% of the time, it works every time 0:09

Significance Problem #1

Two flavours of “significant”: statistical versus clinical

Many scientific papers conclude that their findings are “significant” without actually reporting their numbers. Almost any reader but a statistician will be fooled, because the word “significant” sounds so much like it means “a big deal.” But statistical significance is a technical thing, meaning only that the result of the study probably isn’t a fluke. It is possible and common to have clinically trivial results that are nonetheless statistically significant. And it’s also possible for results to be technically statistically significant and yet still have a better chance of being a coincidence than you might think.

So imagine a treatment for pain that is supposedly “proven” to have an effect, but it’s a tiny effect. It only reduces pain slightly. You can take that data to the bank. It’s real! It’s statistically significant. Except there’s actually still a 5% chance that it’s a mistake.1 But that’s just enough to be “statistically significant.” Technically.

Now imagine that you understand this, and someone tells you the next day that this pain treatment is “proven”! Frustrating! It’s actually fairly common to see such “significant” results: clinically boring and possibly not even real.

Just because a published paper presents a statistically significant result does not mean it necessarily has a biologically meaningful effect.

Science Left Behind: Feel-Good Fallacies and the Rise of the Anti-Scientific Left, Alex Berezow & Hank Campbell

If you torture data for long enough, it will confess to anything.

Ronald Harry Coase

It’s all about P for “percentage”

Statistical significance is measured with the infamous, cryptic “p-value.” P-value illiteracy is epidemic, even though it’s one of the most important numbers in all of statistics and research. If you’re ever going to try to make sense of any scientific research, you need to grok (master) p-values.

The p-value is a just percentage. So P < 0.05 simply means “less than a 5% chance that the results were a coincidence.” Sort of. More formally stated, a low p-value means it’s unlikely you’d get those results if there wasn’t a real effect to be found.2

Head hurting already? Time for the stick people to take over this tutorial.

Learn research statistics from stick people!

Geeky web comic xkcd illustrated the trouble with statistical significance better than I can with my paltry words.


xkcd #882 © by Randall Munroe


xkcd #882 © by Randall Munroe

Hard-hitting comic analysis

A “5% chance of coincidence” is actually a fairly strong chance of a coincidence. That’s one in twenty. That can happen. Something with a one in twenty chance of happening each day is going to happen more than once per month.

Roll a 20-sided die (that’s a Dungeons & Dragons thing) and you’ll notice that any given number comes up pretty often!

Randall Munroe hides an extra little joke in all his comic strips (mouse over his comic strips on his website, wait a moment, and they are revealed). This time it was:

'So, uh, we did the green study again and got no link. It was probably a--' 'RESEARCH CONFLICTED ON GREEN JELLY BEAN/ACNE LINK; MORE STUDY RECOMMENDED!'

A great deal of crap science is presented in exactly this way. It’s one of the main ways that “studies show” a lot of things help pain that aren’t actually do no such thing. It’s one of the easiest ways for the “controversy” over many alternative treatments can be extended: by citing “significant” evidence of benefit, with data exactly as absurd as in the comic strip above.

It’s actually statistically normal for the occasional study to make a bad treatment look good … by freak chance.

A general example of significance problem #1: chiropractic spinal adjustment

A classic real world example of “statistically significant but clinically trivial” is the supposedly proven benefit of chiropractic adjustment: how much benefit, exactly? “Less than the threshold for what is clinically worthwhile,” as it turns out, according to Nefyn Williams, author of a 2011 paper in International Musculoskeletal Medicine.3 I have been pointing out that spinal adjustment benefits seem to be real-but-minor for many years now, and I explore that evidence in crazy detail in my low back pain and neck pain tutorials. It’s a big topic, with lots of complexity.

Williams’ paper offers an oblique perspective that is quite different and noteworthy: his paper is not about chiropractic adjustment, but about the concept of clinical significance itself. There are various ways of measuring improvement in scientific tests of treatments, and, as Williams explains, “when an outcome measure improves by, say, five points it is not immediately apparent what this means.” How much improvement matters? After explaining and discussing various proposed standards and methods, Williams needed a good example to make his point. It’s quite interesting that he picked spinal manipulative therapy.

A specific example of significance problem #1: chondroitin sulfate

Chondroitin sulfate is a “nutraceutical” — a food-like nutritional supplement that is supposedly “good for cartilage” (because it is major component of cartilage). It has been heavily studied, but there has never been any clear good scientific news about it, and it bombed a particularly large and good quality test in 2006 (see Clegg).

So it was a bit hard to believe my eyes when I read the summary of a 2011 experiment claiming that chondroitin sulfate “improves hand pain.”4 Really?!

No, not really. On a 100mm VAS (a pain scale, “visual analogue scale”), the treatment group was 8.77mm happier with their hands. With a p=.02 (a middlin’ P-value, neither high nor low). So basically what the researchers found is a chance that chondroitin makes a small difference in arthritis pain. It’s not nothing, but it is an incredibly unimpressive result — pretty much the definition of clinically insignificant. The authors’ interpretation is like taking the dog to the end of the driveway and saying you took him for a walk. Technically true ...

So that is a lovely demonstration of the abuse of statistical significance!

Significance Problem #2

Statistical significance of the wrong comparison (no analysis of variance)

The first significance “problem” is almost like a trick or truth bending. It works well and fools a lot of people — sometimes, I think, even the scientists or doctors who are using it — because the usage of the term “significant” is often technically correct and literally true, but obscures and diverts attention away from the whole story. Thus it is more like a subtle and technical lie of omission than an actual error.

That’s pretty bad! But it’s not the half of it. It gets much worse: a lot of so-called significant results aren’t even technically correct. Problem #2 is an actual error — something that would get you a failing grade in a basic statistics course.

A widespread, stark statistical error

This bomb comes from a recent analysis of neuroscience research.5 A number of writers have reported on this already. It was described by Ben Goldacre for The Guardian as

a stark statistical error so widespread it appears in about half of all the published papers surveyed from the academic neuroscience research literature.

Dr. Steven Novella also wrote about it for recently, adding that

there is no reason to believe that it is unique to neuroscience research or more common in neuroscience than in other areas of research.

And it is not. Dr. Christopher Moyer is a psychologist who studies massage therapy:

I have been talking about this error for years, and have even published a paper on it. I critiqued a single example of it, and then discussed how the problem was rampant in massage therapy research. Based on the Nieuwenhuis paper, apparently it’s rampant elsewhere as well, and that is really unfortunate. Knowing the difference between a within-group result and a between-groups result is basic stuff.

A significantly irrelevant significant difference

Clinical trials are all about comparing treatments. To be considered effective, a real treatment has to work better than a fake one — a placebo. A drug must produce better results than a sugar pill. If that difference is big enough, it is “statistically significant.” There are a lot of details, but that’s the beating heart of a fair scientific test: can the treatment beat a fake?

What you can’t do is just compare the treatment to nothing at all and say, “See, it works: huge difference! Huge improvement over nothing!” The problem is that both effective medicines and placebos can beat nothing — it just doesn’t mean much until you know the treatment can also trounce a placebo. On its own, a “statistically significant” difference between treatment and nothing at all is the sound of one hand clapping. A meaningful comparison has to be a statistical ménage à trois, comparing all three to each other (analysis of variance, or ANOVA).

The error is the failure to do this. And, shockingly, Nieuwenhuis et al reported that more than half of researchers were making this mistake: comparing treatments and placebos to nothing, but not to each other.


Massage is better than nothing! Too much better …

Studies of massage therapy (and others, like chiropractic6) are particularly plagued by this error. Why? Because massage is so much “better than nothing.” The size of that difference looms large, and so it’s all too easy to mistake it for the one that matters — and fail to even compare the treatment to a placebo. And it’s really hard to come up with a meaningful placebo. It’s notoriously difficult to give a patient a fake massage. (They catch on.)

Statistics does not care about these difficulties: you still can’t compare massage to nothing, stop there, and call the difference “significant.” You still have to do your ANOVA, and massage still has to beat some kind of placebo before it can be considered more effective than pleasant grooming.

This research problem is not limited to massage, but massage is probably be the single best example of it. It crops up when you’re studying any treatment that involves a lot of interaction. The more interaction, the worse the problem gets. It’s a big deal in massage research because massage involves a lot of interaction, much of which is pleasant and emotionally engaging. It’s notoriously difficult to give a patient a fake massage. (They catch on.)Interaction with a friendly health care provider has a lot of surprisingly potent effects: people react strongly and positively to compassion, attention, and touch. The problem is that those benefits have nothing to do with any specific “active ingredient” in a massage. Grooming is just nice. It’s like pizza: even when it’s bad, it’s pretty good.

Much of the good done by therapists of all kinds is attributable to potent placebos driven by their complex interactions with patients, and not by anything in particular that they are doing to the patient. To find out how well a therapy works, it must be compared to sham treatments which are as much like the treatment as possible. This is hard to do, and it has rarely been done well. It’s much more typical to compare therapy to something too lifeless and “easy to beat,” to much like comparing it to nothing at all instead of a real placebo. And there’s the difference error: comparison to the wrong thing, statistical significance of the wrong difference.

About Paul Ingraham

I am a science writer, former massage therapist, and assistant editor of Science-Based Medicine. I have had my share of injuries and pain challenges as a runner and ultimate player. My wife and I live in downtown Vancouver, Canada. See my full bio and qualifications, or my blog, Writerly. You might run into me on Facebook and Google, but mostly Twitter.


  1. This is not exactly correct, as anyone very familiar with p-values and stats. It’s just a plain language representation of some pretty hairy math. I’ll be (slightly) more precise in the next section. BACK TO TEXT
  2. And even more formally stated — the most correct way to say it — a low p-value means that the results would be unlikely specifically if the null hypothesis were true. The null hypothesis is basically the bet there’s actually no effect for the experiment to find. If that bet is right, then you’re only going observe the appearance of a real effect due to a fluke or error. The null hypothesis is difficult but really interesting idea, and I cover it in some detail in Why So “Negative”? Answering accusations of negativity, and my reasons and methods for debunking bad treatment options for pain and injury.. BACK TO TEXT
  3. Williams. How important is the ‘minimal clinically important change’? International Musculoskeletal Medicine. 2011. BACK TO TEXT
  4. Gabay et al. Symptomatic effect of chondroitin sulfate in hand osteoarthritis the finger osteoarthritis chondroitin treatment study (FACTS)in hand osteoarthritis the finger osteoarthritis chondroitin treatment study (FACTS). Arthritis Rheum. 2011. PubMed #21898340. BACK TO TEXT
  5. Nieuwenhuis et al. Erroneous analyses of interactions in neuroscience: a problem of significance. Nat Neurosci. 2011. PubMed #21878926. BACK TO TEXT
  6. This is a problem studying any kind of therapy where patients spend more time with health care professionals. The more time and interaction, the worse the problem. Thus, massage is by far the most difficult of them all — but it’s also a problem studying chiropractic, physical therapy, psychotherapy, osteopathy, and so on. Basically, the higher the potential for a robust placebo, the harder it gets to avoid making this research mistake. BACK TO TEXT