Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Education Science

Scientists Propose To Raise the Standards For Statistical Significance In Research Studies (sciencemag.org) 137

sciencehabit shares a report from Science Magazine: A megateam of reproducibility-minded scientists is renewing a controversial proposal to raise the standard for statistical significance in research studies. They want researchers to dump the long-standing use of a probability value (p-value) of less than 0.05 as the gold standard for significant results, and replace it with the much stiffer p-value threshold of 0.005. Backers of the change, which has been floated before, say it could dramatically reduce the reporting of false-positive results -- studies that claim to find an effect when there is none -- and so make more studies reproducible. And they note that researchers in some fields, including genome analysis, have already made a similar switch with beneficial results.

"If we're going to be in a world where the research community expects some strict cutoff ... it's better that that threshold be .005 than .05. That's an improvement over the status quo," says behavioral economist Daniel Benjamin of the University of Southern California in Los Angeles, first author on the new paper, which was posted 22 July as a preprint article on PsyArXiv and is slated for an upcoming issue of Nature Human Behavior. "It seemed like this was something that was doable and easy, and had worked in other fields."

This discussion has been archived. No new comments can be posted.

Scientists Propose To Raise the Standards For Statistical Significance In Research Studies

Comments Filter:
  • Six-sigma! (Score:4, Funny)

    by msauve ( 701917 ) on Wednesday July 26, 2017 @08:37PM (#54888131)
    Make it Six Sigma, which is really 4.78 sigma (or something like that, I forget the actual number), because they allow a fudge factor to accommodate the fact that 6 sigma isn't realistic.
    • Re:Six-sigma! (Score:5, Insightful)

      by ShanghaiBill ( 739463 ) on Wednesday July 26, 2017 @09:39PM (#54888399)

      Make it Six Sigma

      That would eliminate many false positives, as well as eliminating nearly all true positives. Of course, this will do nothing to reduce flawed studies caused by reasons other than statistics, such as non-representative sampling (e.g.: most mouse studies use only male mice), poor experiment design, shoddy data gathering, sponsorship bias, and outright fraud.

      But, the cost of clinical studies would only increase by an order of magnitude, so what do we have to lose?

  • by Bueller_007 ( 535588 ) on Wednesday July 26, 2017 @09:15PM (#54888265)

    There's a trade-off between sensitivity and specificity. If you increase the threshold for "significance", you reduce the power to discover a significant effect when it truly does exist.

    And a major part of the problem with scientific studies is that they are already underpowered. According to conventional wisdom, ideally, scientists should strive for a power of about 80% (i.e., an 80% chance of detecting an effect if it truly exists), but very few studies actually achieve power of this level. In many fields, the power is less than 50% and sometimes much less.

    Underpowered studies result in two major problems:
    1) Most obviously, an underpowered study results in a greater number of FALSE NEGATIVES. You fail to find a true effect. You will either publish your incorrect result of no effect. (And why should we consider published false positives to be any worse than false negatives?) Alternatively, perhaps you don't publish your study because you couldn't reach significance. This exacerbates the "file-drawer effect" and also results in wasted research dollars because the results aren't published.
    2) Somewhat counterintuitively, underpowered studies are often also more likely to result in FALSE POSITIVES. This is because, when your power to detect a true effect is low, and if you test a large number of effects that are unlikely to be null, most of the hypotheses that you say are "significantly" non-null will actually be false positives. We would say that the "false discovery rate" tends to be very high when the power is low.

    Reducing the level of significance will do little to address these problems, and in some cases may even exacerbate the problem.

    The key is *to move away from the binary concept of "significance" altogether*. It's obviously artificial to have an arbitrary numerical cutoff for "matters" vs. "doesn't matter", and this is not what Ronald Fisher intended when he popularized the p-value or developed the concept of "significance".

    What we should be doing is measuring and reporting effect sizes along with their credible intervals. While using priors that are based on our real state of knowledge. In other words, we should be doing Bayesian statistics.

    • A large number of results that are likely to be null, I mean.

    • I think you answered your own question in (1) - Negative results rarely get published, and so false negatives rarely propagate beyond the researchers involved. False positives are more of a problem specifically because they are far more likely to misguide others.

      Now, the fact that negative results rarely get published is a whole different problem...

      As for (2) - I guess I just don't see how it's counterintuitive - isn't the entire problem with an underpowered study the fact that the real results are likely

    • by Compuser ( 14899 )

      Bayesian statistics is all fine and good but prior selection is an art. Most biologists have trouble with the frequentist approach (beyond what some software tool produces automagically). Do you think they will suddenly be able to specify a proper prior? I would guess that 99% of papers will be Jeffrey's (uninformative priors). And the vast majority of the remainder will be junk, violating, oh I don't know, causality? Good luck with that proposal.

    • According to conventional wisdom, ideally, scientists should strive for a power of about 80% (i.e., an 80% chance of detecting an effect if it truly exists), but very few studies actually achieve power of this level. In many fields, the power is less than 50% and sometimes much less.

      How does one control this in practice? Let's say I want to compare means. Do I need to assume the difference is greater than X to lower bound my power (modulo a bunch of assumptions.) I don't really see this in practice.

  • or is this more attempts to discredit climate change science by setting an impossible bar for proof? I honestly don't know, it's the first I've heard of this. Still, it's hard to imagine it being controversial otherwise.
    • I'm not sure what climate science has to do with this. It was published in a Psychology journal which is firmly in the realm of the social/biological sciences. Climate scientists have their own basket of statistical issues they have to deal with.

  • by Anonymous Coward on Wednesday July 26, 2017 @09:19PM (#54888287)

    I'm not convinced this will help. There are a couple of issues here. Often, the experimental design can be changed, like how certain variables are controlled for, to get a p-value that's below the threshold. The other problem is that p-value is sensitive to the sample size. If you want a lower p-value, increase the sample size. In many cases, p-values aren't a good way to show whether a result is useful or not.

    I'm a meteorologist and I research severe thunderstorms. Let's say that I want to test whether a particular variable is useful in discriminating between tornadic and non-tornadic supercells. One approach might be to calculate the mean of that variable for tornadic supercells and the mean in non-tornadic supercells. The null hypothesis is that the mean of the two samples are the same, and I calculate a p-value. if the sample size is large enough, that is I've included enough supercells, I can make even very small differences in the means appear statistically significant.

    A better approach is to use that variable as a predictor and have two data sets -- a training data set and a testing data set. I then calculate a function to classify storms based on the training data set, using the variable as a predictor of whether a storm will be tornadic or not. Then I test its accuracy with the testing data set and the metric of success is the accuracy of the variable (hits, misses, and false alarms) of whether a storm will be tornadic or not. This is better because increasing the sample size isn't going to achieve a statistically significant result.

    Normally, some kind of baseline is chosen, and you want to show that your method performs better than the baseline. Of course, the problem is that you have a lot of flexibility in how to choose this baseline, and reviewers still need to be careful in how they evaluate work. For example. let's say that I cite a paper saying that climatologically, 20% of supercells or tornadic. I could randomly guess whether a supercell is tornadic based on that 20% probability and use that as my baseline. If my work is useful, hopefully I outperform than random guessing based on climatology.

    This isn't the best way, though, because we know of several variables that are useful in predicting whether supercells will be tornadic or not. A better baseline would be to include variables that are known to be useful and then test whether the additional variable adds skill or not. It also helps to have some physical explanation why a particular variable would affect whether a supercell is tornadic or not.

    There are cases where p-values are useful, but it's also very easy to abuse them. There's no substitute for vigilant reviewers who can spot misuses of statistics. There's nothing magical about a p-value of 0.05 or 0.005. I have no problem with p-values being presented, but I think a better approach would be to require that papers include more than p-values to demonstrate that a result is significant. I've described one such approach above that I use in my own research.

    • Personally, I prefer the "successful predictions" measure of validity.

    • Wonder of wonders.
      A truly scientific post on slashdot.
    • I couldn't agree more. The p-value is a very specific thing, and by itself isn't something that can really say something is valid or not. I've done my share of analysis for folks, and seen the results of others analysis to know that in many cases you can statistically prove just about anything you want to prove and in many cases there is an agenda that is more akin to we're trying to prove X where X is the desired outcome.

      It is all about the data and the methodology, more than the statistical math. Cherry p

    • I can make even very small differences in the means appear statistically significant.

      Small differences in the mean can be statistically significant. And yes you need large sample sizes to show this. Not sure what you mean by appear...

      A better approach is to use that variable as a predictor and have two data sets -- a training data set and a testing data set. I then calculate a function to classify storms based on the training data set, using the variable as a predictor of whether a storm will be tornad

  • This will mean that big pharma will have to run an order of magnitude more studies until they can find the one study which can be published because it shows a positive correlation.

    [yes, I know statistics don't really work that way]

    • by crunchygranola ( 1954152 ) on Wednesday July 26, 2017 @09:32PM (#54888355)

      This will mean that big pharma will have to run an order of magnitude more studies until they can find the one study which can be published because it shows a positive correlation.

      [yes, I know statistics don't really work that way]

      Actually they kind of do!

      A tactic that Pharma companies have pulled many times in the past is to try and kept generic drugs off the market by showing that they are not equivalent to the proprietary product. And they do this by running a couple of dozen of animal studies, with the animals being given the two different products, with various physiological parameters being monitored. When one of these parameters is found to differ between the two drugs by p > 0.05 they submit the result to the FDA declaring that the two drugs are not equivalent in their effects (the parameter of course has nothing to do with the drug's actual pharmacological effect).

      Now with this standard they will have to run 200 or so tests to find one that exceeds p > 0.005.

  • Fisher's comment (Score:5, Informative)

    by Anonymous Coward on Wednesday July 26, 2017 @09:52PM (#54888435)

    If you want reprodicibilty, well then require reproducibility.
    Fisher, the inventor of p-values said this about p-values:
    >>”[We] thereby admit that no isolated experiment, however significant in itself, can suffix for the experimental demonstration of any natural phenomenon In order to assert that a natural phenomenon is experimentally demonstrable we need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result" (Fisher, 1960, p.13-14).

  • How many is a "mega-team"? A million, or one million, forty-eight thousand, five hundred and eighty six?

    It appears to be seventy-two.

    • by PPH ( 736903 )

      It appears to be seventy-two.

      Is this a statistically valid sample size for P0.005? Have they repeated the study with different study teams to see if the results are repeatable?

  • by SlaveToTheGrind ( 546262 ) on Wednesday July 26, 2017 @11:28PM (#54888743)

    This would just provide a new target for the p-hackers [wikipedia.org].

  • "It seemed like this was something that was doable and easy, and had worked in other fields."

    So where are the studies that "prove" this? Oh, and they'd better have a significance of 0.005 or better.

  • by rew ( 6140 ) <r.e.wolff@BitWizard.nl> on Thursday July 27, 2017 @02:28AM (#54889275) Homepage

    The problem with current research in semi-soft sciences like biology and medicine is that the scientists use this p-value wrong.

    If you suspect a glass of wine a day will lower chances of heart disease, take 1000 volunteers, roll a dice and half of them you tell to not have that wine-a-day and the other half you tell, please drink one glass of wine a day. Next you wait two years, and evaluate the incidence of heart problems in the two groups. That's where 0.05 P-value is acceptable. (in practice, telling people ot suddenly stop or start drinking is not going to go well).

    Things become problematic when you suspect: "something we can measure may be related to this disease" (e.g. Sarcoidosis), you take 200 patients and 200 healthy people and then measure 200 parameters in each of the 400 blood samples... Provided there is little to measure, you'll find about 1/20th or about 10 parameters that DO seem to be (p=0.05) different between the two groups.

    In the case at hand one or two measurable parameters ARE, different in the patient-group. So you'll have a better than 95% chance of finding those. Of the 198 other parameters you'll find 1/20th of false positives, for a total of almost 12 publishable results.

    Should you want to increase your chances of finding these publishable results, the sample size needs to be relatively small. The group of 200 patients and 200 healthy people might already be too big to get enough spurious results. Even if they don't do this consciously, the scientists will quickly be able ot optimize their sample size to find publishable results.

    When I was a freshman in 1985, some guy asked me to help him put his research in the computer. He had formulated 50 or so questions and predicted boys would answer differently than girls. So he went into a classroom, interviewed 30 boys and girls and put his results in the computer. Of course the computer told him there were several significant differences between boys and girls. Some of them real (do you like to play with trains? Dolls?) some of them not (I don't remember the example).

    The other example is more recent. A Dutch Doctor got her PhD with (among others) the described sarcoidosis research. But my run-ins with subject are very limited simply because I don't move in those circles. This is way more widespread than just the few examples that I encounter personally.

    Then people try to "fix" this by proposing the wrong solutions.

    The research: "can we find a parameter that allows us to differentiate between the two groups" is very important as well. But you have to do your research in the right way. Take 100 patients and 100 healthy people and find the parameters that seem to make a difference. NOW you go into the second half of the research with a hypothesis: "this parameter is important" and verify your claim. Now the p=0.05 is acceptable. (a 5% chance that you're wrong, as opposed to a 95% chance your'e full of shit).

  • by Master Of Ninja ( 521917 ) on Thursday July 27, 2017 @03:32AM (#54889493)
    After viewing it first hand, there are a lot of people going through "degree factories", getting degrees that are getting only the basics of statistical knowledge. And a little knowledge is very dangerous. The p-value is a useful measure, but it's been simplified to (p less than 0.05 = good) in biomedical circles. And if you read the other upvoted threads, or read some of the linked articles, you'll understand why this is a big problem.

    There are a few tensions here that I think may be causing this: (a) publish or perish - if it looks reasonable enough, publish because that's where your next job comes from, (b) poor statistical training - can be from both the authors and reviewers side, (c) unwillingness to fund or publish work that is reproducing previous results - there is a publisher created publication bias, (d) the general high cost of patient centred biomedical research, so meaning your have low sample numbers generally, (e) the unwillingness in some disciplines to get formal statistical input.

    What are the potential solutions? If there was an unrestricted money pool you can recruit adequately (n>10000) to each study, but the money is not there, and there are some very rare diseases around. Better statistical training would be ideal, and there has been a push towards Bayesian analysis: I would think that as in most statistical tools someone will eventually find a way to inappropriately use them. Self-publish as an option - could be possible: I've seen some horrifically bad peer reviewed articles (& predatory journals! [wikipedia.org]) but there is an ethical tension between publishing without review which could just flood the literature with absolute garbage which is difficult to sort through, and actual proper peer review. Maybe something like Arxiv [arxiv.org] for biomedical science, although there would be a lot of resistance to it I suspect.

    I don't hold too many hopes for a quick solution to this as there are a lot of vested interests, and people using the best new fangled statistical methods they've learned. I've even reviewed a paper recently, with multiple authors from a big university, where I just shook my head at the amount of statistical fudging that took place: the authors had imputed [wikipedia.org] about 80% of their primary predictor variable for an outcome, and then came up with a conclusion based on the imputed data. I just shook my head that this was actually allowed nowadays. While this article is good, some of the authors have been banging on about it for some time without much change.
  • by epine ( 68316 ) on Thursday July 27, 2017 @07:30AM (#54890161)

    Most of the comments I skimmed are missing the point.

    The real problem is that even scientists with the best training and the best intentions wind up committing a certain amount of p-hacking subconsciously. Just a simple data exploration to decide post hoc whether any collected data is corrupted or implausible, and you've already slithered one toe across the p-hacking line.

    When p-values gate publication, and publication gates promotion, you create a severe moral hazard where many of the scientists you end up promoting lie on the bottom half of the curve in self-policing their accidental p-hacking. The guy with the penchant to do slightly more irregular experiments, which require slightly more data cleanup, seems to get slightly more published results. Ba da boom.

    p=0.005 would put a pretty big crimp in this effect.

    Of course it doesn't solve the larger problem. But good golly, first things first.

    We also know from replication efforts that p=0.05 is allowing far too much crap to float over the gate. p=0.005 probably gets us closer the crap level we naively assumed we'd get when we originally rallied around p=0.05.

    Probably the increased use of computers hasn't helped matters: even accidental p-hacking with pencil and paper is hard work.

  • There is a 95 percent chance this paper is significant. Oh wait, I meant to say 99.5 percent. I'll get the p-value right eventually.
  • The exact probability a field's (eg, a journal's) article is true can be found in John Ionnidis Plos Medicine article "Why Most Published Research Findings are False". He lets R be the number of true relationships to no relationships among those tested in a field. It's equivalent to a background probability (prior, though perhaps unknown). The positive predictive value (PPV), a probability, is PPV = (1 - Beta) * R / (R - Beta * R + alpha). A coarse bound for this is, when alpha = 0.05,
  • ... is not in changing the epsilon value of P

    The real answer is in *requiring* 2 things:

    • - that significance criteria be fixed BEFORE the experiment is run
    • - that researchers be required to use NON-parametric statistical measures, as Fisher originally intended
  • If you only allow publication of effects with p less than 0.005, that means that in order to prevent the publication of one false positive, you are discarding ~190 results that had a true difference in outcome. I'd agree that it is better to publish nothing than to publish a wrong result, but this level of certainty seems to me excessive. Maybe 0.05 is a little too high, but surely at 0.02 or 0.01 (a one-in-a-hundred chance that you are wrong), it is time to move on to the next experiment, not keep doing

Utility is when you have one telephone, luxury is when you have two, opulence is when you have three -- and paradise is when you have none. -- Doug Larson

Working...