## Why P-values Cannot Tell You If a Hypothesis Is Correct 124 124

ananyo writes

*"P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume. Critically, they cannot tell you the odds that a hypothesis is correct. A feature in**Nature*looks at why, if a result looks too good to be true, it probably is, despite an impressive-seeming P value."
## Oblig XKCD (Score:5, Informative)

http://xkcd.com/882/ [xkcd.com]

Even the example of p=0.01 from the article is subject to the same problem. That's why the LHC worked for something like 6 sigma before declaring the higgs boson to be discovered. Even then, there's always the chance, however remote, that statistics fooled them.

## Re:Oblig XKCD (Score:5, Informative)

While I agree with the article's headline/conclusion - They aren't innocent of playing games themselves:

Take their sentence: "meeting online nudged the divorce rate from 7.67% down to 5.96%, and barely budged happiness from 5.48 to 5.64 on a 7-point scale" ... Isn't that intentionally misleading? Sure, 0.16 points doesn't sound like much... but it's on a seven point scale. If we change that to a 3 point scale it's only 0.06 points! Amazingly small! ... but wait, if I change that to a 900,000 point scale, well, then that's a whole 20,571 points difference. HUGE NUMBERS!

But I think they missed a really important point - SPSS (one of the very popular data analysis packages) offers you a huge range of correlation tests, and you are _supposed_ to choose to best match the data. Each has their own assumptions, and will only provide the correct 'p' value if the data matches those assumptions.

For example, Many of the tests require that the data follow a bell-shaped curve, and you are supposed to first test your data to ensure that it is normally distributed before using any of the correlation tests that assume normally distributed data. If you don't, you risk over-stating the correlation.

If you have data from a likert scale, you should treat it as ordinal (ranked) data, not numerical (ie. the difference between "Totally Disagree" and "somewhat disagree" should not be assumed to be the same as the difference between "somewhat disagree" and " totally agree") - however, if you aren't getting to the magic p0.5 treating it as ordinal data, you can usually get it over the line by treating it as numerical data and running a different correlation test.

Lecturers are measured on how many papers they publish, most peer reviewers don't know the subtle differences between these tests, so as long as they see 'SPSS said p0.5' and they don't disagree with any of the content of your paper, yay, you get published.

Finally, many of the tests have a minimum sample size that should ever be analysed. If you only have a study of 300 people, there's a whole range of popular correlation tests that you are not supposed to use. But you do, because SPSS makes it easy, because it gets better results, because you forgot what the minimum size was and can't be arsed looking it up (if it's a real problem the reviewers will point it out).

(Evidence to support these statements can be found in the "Survey Researcher's SPSS Cookbook" by Mark Manning and Don Munro. Obviously, it doesn't go into how you can choose an incorrect test to 'hack the p value', to prove that I recommend you download a copy of SPSS and take a short-term position as a lecturer's assistant)

## Torturing the data (Score:4, Informative)

One variant of "p-hacking" is "torturing the data", or performing the same statistical test over and over again, on slightly different data sets, until you get the result that you want. You

willeventually get the result you want, regardless of the underlying reality, because there is 1 spurious result for every 20 statistical tests you perform (p=0.05).I remember one amusing example, which involved a researcher who claimed that a positive mental outlook increases cancer survival times. He had a poorly-controlled study demonstrating that people who keep their "mood up" are more likely to survive longer if they have cancer. When other researchers designed a larger, high-quality study to examine this phenomenon, it found no effect. Mood made no difference to survival time.

Then something interesting happened. The original researcher responded by

looking for subsets of the datafrom the large study, to find any sub-groups where his hypothesis would be confirmed. He ended up retorting that "keeping a positive mental outlook DID work, according to your own data, for 35-45 year-old east asian females (peven if the p value was 0.05.This kind of thing crops up all the time.

## Re:And this is why (Score:4, Informative)

Other way around, and not quite true for a properly formulated hypothesis.

Frequentist statistics involves making a statistical hypothesis, choosing a level of evidence that you find acceptable (usually alpha=0.05) and using that to accept or reject it. The statistical hypothesis is tied to your scientific hypothesis (or it should be). If the standard of evidence is met or exceeded, the results support the hypothesis. If not, they don't mean anything.

HOWEVER, if you specify your hypothesis well, you include a

minimumdifference that you consider meaningful. You then calculate error bars for your result and, if they show that your measured value is less than the minimum you hypothesized, that's evidence supporting the negative (not the null) hypothesis: any difference is so small as to be meaningless.I am not a fan of everyone using Bayesian techniques. A Bayesian analysis of a single experiment that gives a normally distributed measurement (which is most of them) with a non-informative prior is generally equivalent to a frequentist analysis. Since scientists already have trouble doing simple frequentist tests correctly, they do not need to be doing needless Bayesian analyses.

As for informative priors, I don't think they should ever be used in the report of a single experiment. Report the p-value and the Bayes factor, or the equivalent information needed to calculate them. Since an informative prior is inherently subjective, the reader should be left to make up his own mind what it is. Reporting the Bayes factor makes meta-analyses, where Bayesian stats SHOULD be mandatory, easier.