typodupeerror
DEAL: For \$25 - Add A Second Phone Number To Your Smartphone for life! Use promo code SLASHDOT25. Also, Slashdot's Facebook page has a chat bot now. Message it for stories and more. Check out the new SourceForge HTML5 Internet speed test! ×

## Weak Statistical Standards Implicated In Scientific Irreproducibility182

ananyo writes "The plague of non-reproducibility in science may be mostly due to scientists' use of weak statistical tests, as shown by an innovative method developed by statistician Valen Johnson, at Texas A&M University. Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF). He advocates for scientists to use more stringent P values of 0.005 or less to support their findings, and thinks that the use of the 0.05 standard might account for most of the problem of non-reproducibility in science — even more than other issues, such as biases and scientific misconduct."
This discussion has been archived. No new comments can be posted.

## Weak Statistical Standards Implicated In Scientific Irreproducibility

• #### Five Sigma or Bust (Score:2)

Five sigma is the standard of proof in Physics. The probability of a background fluctuation is a p-value of something like 0.0000006.
• #### Re:Five Sigma or Bust (Score:4, Interesting)

on Tuesday November 12, 2013 @08:02PM (#45407289)

Five sigma is the standard of proof in Physics. The probability of a background fluctuation is a p-value of something like 0.0000006.

Of proof yes... that makes sense.

Other fields should probably use a threshold of 0.005 or 0.001.

If they use move to five sigma....... 2013 might be the last year that scientists get to keep their jobs.

What are you supposed to do; if no research in any field is admissable, because the bar is so high noone can meet it, even with meaningful research?

• #### Re:Five Sigma or Bust (Score:4, Insightful)

<wwoodhull@gmail.com> on Tuesday November 12, 2013 @08:51PM (#45407703) Homepage Journal

Agreed. P = 0.05 was good enough in my high school days, when handheld calculators were the best available tool in most situations, and you had to carry a couple of spare nine volt batteries for the thing if you expected to keep it running through an afternoon lab period.

We have computers, sensors, and methods for handling large data sets that were impossible to do anything with back in the day before those first woodburning "minicomputers" of the 1970s. It is ridiculous that we have not tightened up our criteria for acceptance since those days.

Hell, when I think about it, using P = 0.05 goes back to my Dad's time, when he was using a slide rule while designing engine parts for the SR-71 Blackbird. That was back in the 1950s and '60s. We should have come a long way since then. But have we?

• #### Re: (Score:2)

Hell, when I think about it, using P = 0.05 goes back to my Dad's time, when he was using a slide rule while designing engine parts for the SR-71 Blackbird. That was back in the 1950s and '60s. We should have come a long way since then. But have we?

In engineering? Yes [rt.com]. Science? Well...

• #### Re: (Score:2, Insightful)

by Anonymous Coward

Agreed. P = 0.05 was good enough in my high school days, when handheld calculators were the best available tool in most situations

Um, the issue is not that it is difficult to calculate P-values less than 0.05. Obtaining a low p-value requires either a better signal to noise ratio in the effect you're attempting to observe, or more data. Improving the signal to noise ratio is done by improving experimental design, removing sources of measurement error like rater reliability, measurement noise, covariates, etc. It should be done to the extent feasible, but you can't wave a magic wand and say "computers" to fix it. Likewise, data collect

• #### Re:Five Sigma or Bust (Score:4, Insightful)

on Wednesday November 13, 2013 @12:31AM (#45409273)

We have computers, sensors, and methods for handling large data sets that were impossible to do anything with back in the day before those first woodburning "minicomputers" of the 1970s. It is ridiculous that we have not tightened up our criteria for acceptance since those days.

But that stuff isn't the limiting factor. The limiting factor is usually getting enough high quality data. In certain fields that's very hard because measurements are hard or expensive to make and the signal to noise is poor. So you do the best you can. This is why criteria aren't tighter now than before: because stuff at the cutting edge is often hard to do.

• #### Re: (Score:2)

As others have said, the problem in many cases is not computational power, but expense or difficulty or even ethics of getting a large data set. What about a medical trial -- do want to necessitate giving some experimental medicine to 10,000 people before assessing whether it's a good idea or not?

• #### Re: (Score:2)

do want to necessitate giving some experimental medicine to 10,000 people before assessing whether it's a good idea or not?

We have 40,000 patent attorneys in the US alone, so there's one great sample for experimental medicine. Not very heterogeneous, but will do in a pinch.

• #### Re: (Score:3)

do want to necessitate giving some experimental medicine to 10,000 people before assessing whether it's a good idea or not?

If the drug is going to be prescribed to millions of people a year: yes, probably. If not during a phase III trial than during a phase IV trial that begins as soon as the drug goes on the market. The reason being that while efficacy can be extrapolated from smaller trials safety is all about the outliers. A few excess deaths in a trial of several thousand could easily mean that the drug causes more harm than good overall, or that an identifiable patient subgroup can't tolerate the drug.

• #### Well yes, actually (Score:2)

> do want to necessitate giving some experimental medicine to 10,000 people before assessing whether it's a good idea or not?

Yes. Before giving it to a million people, we should run statistical calculations on the first 10,000 to better asses safety and efficacy.

Oh, you meant as opposed to a trial with 200 people. But that's a false dichotomy. You run run stats on the first 200 to see whether
or not it's likely safe, then run stats on 10,000 to confirm it. Which is to say, you'd wait until you managed

• #### Re: (Score:2)

"In the meantime, with a P of 0.05, you'd label it as a tentative conclusion, a likely theory."

Great, so we agree: Getting a P of less than 0.05 with a sample of 200 gets you published and other machinery acting on that signal.

• #### Re: (Score:2)

"What are you supposed to do; if no research in any field is admissable, because the bar is so high noone can meet it, even with meaningful research?"

James Cameron could reach the bar.

• #### Re: (Score:2)

James Cameron could reach the bar.

Hm... James Cameron is a deep-sea explorer, and film director.... he directed Titanic.

In what way, does that make him a researcher who could be sure of meeting five sigma in all his research; even when infeasible truly massive datasets would be required?

• #### Re: (Score:2)

Why should it be generalized to whole fields rather than based on what you are studying?

If I'm publishing that drug X does not increase the incidence of spontaneous human combustion, there ought to be a lot of zeroes in that P value. If I'm publishing that "As expected, Protein X does job Y in endangered species Z, which is not surprising given that protein X does job Y in every other species tested, and why the hell did we even do this experiment" then maybe you don't need such a high standard.
• #### Scarcely productive (Score:5, Interesting)

on Tuesday November 12, 2013 @07:59PM (#45407255)

Such an admonishment is fine for the computational fields, where a few more permutations can net you a p-value of 0.0005 (assuming that you aren't crunching on a 4-month cluster problem). However, biological laborations are often very expensive and take a lot of time. Furthermore, additional tests are not always possible, since it can be damn hard to reproduce specific mutations or knockout sequences without altering the surrounding interactive factors.

So, should we go for a better p-value for the experiment and scrap any complicated endeavour, or should we allow for difficult experiments and take it with a grain of salt?

• #### Economic Impact (Score:3, Insightful)

by Anonymous Coward on Tuesday November 12, 2013 @08:00PM (#45407269)

Truth is expensive.

• #### Re: (Score:2)

but producing shitty studies pays just as well as producing good studies, especially if the study is about something of no consequence at all to anyon..

• #### Not going to happen (Score:4, Insightful)

by Anonymous Coward on Tuesday November 12, 2013 @08:02PM (#45407281)

If we were to insist on statistically meaningful results 90% of our contemporary journals would cease to exist for lack of submissions.

• #### Re:Not going to happen (Score:4, Insightful)

on Tuesday November 12, 2013 @08:42PM (#45407623)

...and nothing of value would be lost. Seriously, have you read the papers coming from that 90% of journals and conference proceedings outside of the big ones in \$field_of_study? The vast majority of them suck, have extraordinarily low standards, and are oftentimes barely readable. There's a reason why the major conferences/journals that researchers actually pay attention to routinely turn away between 80-95% of papers being submitted: it's because the vast majority of research papers are unreadable crap with marginal research value being put out to bolster someone's published paper count so that they can graduate/get a grant/attain tenure.

If the lesser 90% of journals/conferences disappeared, I'd be happy, since it'd mean wading through less cruft to find the diamonds. I still remember doing weekly seminars with my research group in grad school, where we'd get together and have one person each week present a contemporary paper. Every time one of us tried to branch out and use a paper from a lesser-known conference (this was in CS, where the conferences tend to be more important than the journals), we ended up regretting it, since they were either full of obvious holes, incomplete (I once read a published paper that had empty Data and Results sections...just, nothing at all, yet it was published anyway), or relied on lots of hand-waving to accomplish their claimed results. You want research that's worth reading, you stick to the well-regarded conferences/journals in your field, otherwise the vast majority of your time will be wasted.

• #### Interpretation of the 0.05 threshold (Score:5, Insightful)

on Tuesday November 12, 2013 @08:06PM (#45407325) Journal

Personally, I've considered results with p values between 0.01 and 0.05 as merely 'suggestive': "It may be worth looking into this more closely to find out if this effect is real." Between 0.01 and 0.001 I'd take the result as tentatively true - I'll accept it until someone refutes it.

If you take p=0.04 as demonstrating a result is true, you're being foolish and statistically naive. However, unless you're a compulsive citation follower (which I'm not) you are somewhat at the mercy of other authors. If Alice says "In Bob (1998) it was shown that ..." I'll tend to accept it without realizing that Bob (1998) was a p=0.04 result.

Obligatory XKCD [xkcd.com]

• #### Re: (Score:2)

Obligatory XKCD [xkcd.com]

FWIW, tests like the Tukey HSD ("Honestly Statistically Different") are designed to avoid that problem.

I suspect that's how the much-discussed "Jupiter Effect" for astrology came about: Throw in a big pile of names and birth signs, turn the crank, and watch a bogus correlation pop out.

• #### Re: (Score:2)

p values in this context don't tell you if something is true. It tells you that the data is unlikely to be from the *null* model. Its not the same as support for the alternative.
• #### Re: (Score:2)

A p-value of 0.05 means that 1 in 20 results are false positives. This implies that 5% of all scientific papers with a p-value of 0.05 are false. However, applying some statistics, it might even be worse than that [economist.com]. Textual summary of that link:

Say 1000 hypotheses are tested, and that 10% are true - that is, 100 true hypotheses, and 900 false ones. If the false positive rate is 5%, then 45 of the 900 will end up true. Further, let's say there's a false negative rate of 10%. So of the 100 true hypotheses, 1
• #### Obligatory XKCD (Score:2, Funny)

by Anonymous Coward

http://xkcd.com/882/

• #### A universal standard for significance... (Score:3, Insightful)

by Anonymous Coward on Tuesday November 12, 2013 @08:11PM (#45407369)

Authors need to read this: http://www.deirdremccloskey.com/articles/stats/preface_ziliak.php
It explains quite clearly why a p value 0.05 is a fairly arbitrary choice as it cannot possibly the standard for every possible study out there. Or, put it another way, be very skeptical when one sole number (namely 0.05) is supposed to be a universal threshold to decide on the significance of all possible findings, in all possible domains of science. The context of any finding still matters for its significance.

• #### The Economist just had an article on this (Score:3)

on Tuesday November 12, 2013 @08:25PM (#45407491)

Unreliable research
Trouble at the lab
Scientists like to think of science as self-correcting. To an alarming degree, it is not
Oct 19th 2013 |From the print edition
The Economist

First, the statistics, which if perhaps off-putting are quite crucial. Scientists divide errors into two classes. A type I error is the mistake of thinking something is true when it is not (also known as a “false positive”). A type II error is thinking something is not true when in fact it is (a “false negative”). When testing a specific hypothesis, scientists run statistical checks to work out how likely it would be for data which seem to support the idea to have come about simply by chance. If the likelihood of such a false-positive conclusion is less than 5%, they deem the evidence that the hypothesis is true “statistically significant”. They are thus accepting that one result in 20 will be falsely positive—but one in 20 seems a satisfactorily low rate.

In 2005 John Ioannidis, an epidemiologist from Stanford University, caused a stir with a paper showing why, as a matter of statistical logic, the idea that only one such paper in 20 gives a false-positive result was hugely optimistic. Instead, he argued, “most published research findings are probably false.” As he told the quadrennial International Congress on Peer Review and Biomedical Publication, held this September in Chicago, the problem has not gone away.

Dr Ioannidis draws his stark conclusion on the basis that the customary approach to statistical significance ignores three things: the “statistical power” of the study (a measure of its ability to avoid type II errors, false negatives in which a real signal is missed in the noise); the unlikeliness of the hypothesis being tested; and the pervasive bias favouring the publication of claims to have found something new.

• #### What is a p Value? (Score:2)

A significant problem is that many of the people who quote p values do it without understanding what a p value actually means. Getting p = 0.05 does not mean that there is only a 5% chance that the model is wrong. That is one of the fundamental misunderstandings in statistics, and I suspect that it is behind a lot of the cases of scientific irreproducibility.

• #### The real issue (Score:5, Interesting)

on Tuesday November 12, 2013 @09:06PM (#45407845) Homepage Journal

Okay, here's the real problem with scientific studies.

All science is data compression, and all studies are are intended to compress data so that we can make future predictions. If you want to predict the trajectory of a cannonball, you don't need an almanac cross referencing cannonball weights, powder loads, and cannon angles - you can calculate the arc to any desired accuracy with a set of equations that fit on half a page. The half-page compresses the record of all prior experience with cannonball arcs, and allows us to predict future arcs.

Soft science studies typically make a set of observations which relate two measurable aspects. When plotted, the data points suggest a line or curve, and we accept the linear-regression (line or polynomial) as the best approximation for the data. The theory being that the underlying mechanism is the regression, and unrelated noise in the environment or measurement system causes random deviations of observation.

This is the wrong method. Regression is based on minimizing squared error, which was chosen by Laplace for no other reason that it is easy to calculate. There's lots of "rationalization" explanations of why it works and why it's "just the best possible thing to do", but there's no fundamental logic that can be used to deduce least squares from from fundamental assumptions.

Least squares introduces several problems:

1) Outliers will skew the values, and there is no computable way to detect or deal with outliers (source [wikipedia.org]).

2) There is no computable way to determine whether the data represent a line or a curve - it's done by "eye" and justified with statistical tests.

3) The resultant function frequently looks "off" to the human eye, humans can frequently draw better matching curves; meaning: curves which better predict future data points.

4) There is no way to measure the predictive value of the results. Linear regression will always return the best line to fit the data, even when the data is random.

The right way is to show how much the observation data is compressed. If the regression function plus data (represented as offsets from the function) take fewer bits than the data alone, then you can say that the conclusions are valid. Further, you can tell how relevant the conclusions are, and rank and sort different conclusions (linear, curved) by their compression factor and choose the best one.

Scientific studies should have a threshold of "compresses data by N bits", rather than "1-in-20 of all studies are due to random chance".

• #### Re: (Score:2)

1) Outliers will skew the values, and there is no computable way to detect or deal with outliers (source [wikipedia.org])

Do outliers skew the results? If the outliers are biased, then that may tell us something about the underlying population. If they aren't biased, then their effects cancel.

4) There is no way to measure the predictive value of the results. Linear regression will always return the best line to fit the data, even when the data is random.

But random data would generate statistically insignific

• #### Re: (Score:2)

Do outliers skew the results? If the outliers are biased, then that may tell us something about the underlying population. If they aren't biased, then their effects cancel.

There's no algorithm that will identify the outliers in this example [dropbox.com].

But random data would generate statistically insignificant correlation coefficients. Also, the 95% confidence intervals used to predict values are wider for random data.

What value of correlation coefficient distinguishes pattern data from random data in this image [wikimedia.org]?

• #### Re: (Score:2)

There's no algorithm that will identify the outliers in this example [dropbox.com].

So there's no algorithm for comparing observed values to modeled (predicted) values? The absolute value of the difference between the two can't be calculated? Hmm. . .

What value of correlation coefficient distinguishes pattern data from random data in this image [wikimedia.org]?

Are the data in that image random? Also, the data without the four points at the bottom would have a higher correlation coefficient.

• #### Re: (Score:2)

Also, you may want to account for the difference between the x coordinate of the point and the average of the xs, as having an x coordinate far from the mean contributes to being farther away from the regression line.

• #### Re: (Score:2)

Oops, I missed the second image. But the correlation coefficients are there. The sets of data that more closely approximate a line have such values close to 1 or -1. The ones that don't have values close to 0.

• #### Re: (Score:2)

"If they aren't biased [outliers], then their effects cancel."

Oh, god no. The book I teach out of says that if outliers exist, it's required to do the regression both with and without the outliers and compare. Frequently there will be a big difference. (Weiss, Introductory Statistics, Sec. 14.2)

• #### Re: (Score:2)

Actually, there is a really good reason to use least-squares regression. A model that minimizes squared error is guaranteed to minimize the variance of error, obviously. Now assume that in a model you have taken into account all variables that have real predictive value, and are fairly independent. Then your error should be normally distributed, and randomly over the range of your data by the Central Limit Theorem. So if your data looks like that after fitting the model, your model probably has very good re
• #### Variance of error is not what we want (Score:2)

Actually, there is a really good reason to use least-squares regression. A model that minimizes squared error is guaranteed to minimize the variance of error, obviously.

This is the wrong place for an argument (you want room 12-A [youtube.com]) and I don't want to get into a contest, but for illustration here is the problem with this explanation.

A rule learned from experience should minimize the error, not the variance of error.

It's a valid conclusion from the mathematics, but based on a faulty assumption.

• #### Re: (Score:2)

The problem I have with least squares is that I don't like the definition of the "error". If you have two things that are correlated, one isn't necessariy a function of the other that includes some variability. If you flip the X and Y axes over - plot height against weight, rather than weight against height - then the least squares regression gives a different line. If the two errors are both minimised, but different, then neither of them is the "real" error. One's the (squared) distance to the regression
• #### That's brilliant - thanks! (Score:2)

The problem I have with least squares is that I don't like the definition of the "error". If you have two things that are correlated, one isn't necessariy a function of the other that includes some variability. If you flip the X and Y axes over - plot height against weight, rather than weight against height - then the least squares regression gives a different line. If the two errors are both minimised, but different, then neither of them is the "real" error.

Wow - brilliant insight! Thanks for that - things like this are why I come to Slashdot.

• #### Re: (Score:2)

I forgot to include in my post the fact that as I was reading the article and earliest posts I was fomenting in my head an idea that was based on information theory, and so was *most pleased* to see your compression example - the kind of thing that I come to slashdot for. That's, I think, the purest approach; whether it's practical or not is another matter, and - to a pure mathematican like me - irrelevant!

I replied to your post rather than the parent, where it was perhaps more relevant, as it was you I wis
• #### Hey - pure mathematician! (Score:2)

Can I discuss some ideas with you offline? thon dot 9 dot okianwarrior at spamgourmet dot com

• #### Re: (Score:2)

Continuous data is just one possible scientific problem. Most studies are done comparing 2 groups which differ by a categorical variable. There are other forms of regression like ordinal linear regression...
• #### An example (Score:3)

on Tuesday November 12, 2013 @09:59PM (#45408193) Journal

Having quickly skimmed the paper, I'll give an example of the problem.
I couldn't quickly find a real data set that was easy to interpret, so I'm going to make up some data.
Chance to die before reaching this age
Age woman man
80 .54 .65
85 .74 .83
90 .88 .96
95 .94 .98

We have a person who is 90 years old. Taking the null hypothesis to be that this person is a man, we can reject the hypothesis that this is a man with greater than 95 percent confidence (p=0.04). However, if we do a Bayesian analysis assuming prior probabilities of 50 percent for the person being a man or a woman, we find that there is a 25 percent chance that the person is a man after all (as women are 3 times more likely to reach age 90 than men are.)

(Having 11 percent signs in my post seems to have given /. indigestion so I've had to edit them out.)

• #### Well, duh. (Score:2)

Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF).

.
Found? Was he unaware that using a threshold of 0.05 means a 20% probability that a finding is a chance result - by definition ?

More interesting, IMO, is that statistical doesn't tell you what the scale of an effect is. There can be a trivial difference between A and B even if the difference is statistically significant. People publish it anyway.

• #### Re: (Score:2)

There can be a trivial difference between A and B even if the difference is statistically significant. People publish it anyway.

This is especially prevalent in medicine (especially drug advertising). If you look at medical journals, they are replete with ads touting that Drug A is 'statistically better' than Drug B. Even looking at the 'best case' data (the pretty graph in the advert) you quickly see that the lines very nearly converge. Statistically significant. Clinically insignificant.

Lies, Damned Lies and Statistics

• #### Re: (Score:2)

Was he unaware that using a threshold of 0.05 means a 20% probability that a finding is a chance result - by definition ?

A P-value [wikipedia.org] of 0.05 means by definition that there is a 0.05, or 5%, or 1 in 20, probability that the result could be obtained by chance even though there's no actual relationship.

• #### Re: (Score:2)

Johnson found that a P value of 0.05 or less — commonly considered evidence in support of a hypothesis in many fields including social science — still meant that as many as 17–25% of such findings are probably false (PDF).

. Found? Was he unaware that using a threshold of 0.05 means a 20% probability that a finding is a chance result - by definition ?

More interesting, IMO, is that statistical doesn't tell you what the scale of an effect is. There can be a trivial difference between A and B even if the difference is statistically significant. People publish it anyway.

Ofcourse it was found. The 20% are not by definition but a function of the percentage of studies done based on correct/incorrect H1. You could have 0% if you only did studies on correct H1s.

• #### Re: (Score:2)

Wanted to reply to the parent, sorry.
• #### Re: (Score:2)

Found? Was he unaware that using a threshold of 0.05 means a 20% probability that a finding is a chance result - by definition ?

1 in 20 != 20%.

If p=0.2, then you'd be right.

• #### Integrating to x sigma (Score:2)

A surprising number of senior scientists are not aware of the problems introduced by ending an experiment based on achieving a certain significance level. By taking the significance as the criterion of the experiment, you don't actually know anything about the significance. Your highly significant result may just be a fluctuation because, had you continued, the high signal-to-noise ratio could well dissipate. Too often I've heard senior scientists advising junior scientists: You've got three sigma, publi
• #### Re: (Score:2)

A surprising number of senior scientists are not aware of the problems introduced by ending an experiment based on achieving a certain significance level.

I think the vast majority are aware of at least some of the problems. The issue is the ones who are willing to publish their results without addressing those problems honestly in the writeup.

• #### not the real problem (Score:4, Insightful)

on Tuesday November 12, 2013 @10:11PM (#45408285)
At one level, they are right that unreproducible results are usually not fraud, but are simply fluctuations that make a study look promising leading to publication. But raising the standard of statistical significance will not really improve the situation. The most important uncertainties in most scientific studies are not random. You can't quantify them assuming a gaussian distribution. There are all kind of choices made in acquiring, processing, and presenting data. The incentives that scientists have are all pushing them to look for ways to obtain a high profile result. We make our best guesses trying to be honest, but when a set of guesses leads to a promising result we publish it and trust further study to determine whether our guesses were fully justified. There is one step that would improve the situation. We need to provide a mechanism to receive career credit for reproducing earlier results or for disproving earlier results. At the moment, you get no credit for doing this. And you will never get funding to do it. The only way to be successful is to spit out a lot of papers and have some of them turn out to be major results that others build on. The number of papers that turn out to be wrong is of no consequence. No one even notices except a couple of researchers who try to build on your result, fail, and don't publish. In their later papers they will probably carefully dance around the error so as not to incur the wrath of a reviewer. If reproducing earlier results was a priority, then we would know earlier which results were wrong and could start giving negative career credit to people who publish a lot of errors.
• #### The bigger problem (Score:2)

The bigger problem is the habit of confusing correlation with cause.

• #### Let's get something straight you non-staticians (Score:4, Insightful)

on Tuesday November 12, 2013 @10:50PM (#45408553)

This is a geek website, not a "research" website so stop talking a bunch of crap about a bunch of crap. I'm providing silly examples so don't focus upon them. Most researchers suck at stats and my attempt at explaining should either help out or show that I don't know what I'm talking about. Take your pick.

"p=.05" is a stat that reflects the likelihood of rejecting a true null hypothesis. So, lets say that my hypothesis is that "all cats like dogs" and my null hypothesis is "not all cats like dogs." If I collect a whole bunch of imaginary data, run it through a program like SPSS, and the results turn out that my hypothesis is correct then I have a .05 percent chance that the software is wrong. In that particular imaginary case, I would have committed a Type I Error. This error has a minimal impact because the only bad thing that would happen is some dogs get clawed on the nose and a few cats get eaten.

Now, on a typical experiment, we also have to establish beta which is the likelihood of committing a type II error, that is, accepting a false null hypothesis. So let's say that my hypothesis is that "Sex when desired makes men happy" and my null hypothesis is "Sex only when women want it makes men happy." It's not a bad thing if #1 is accepted but the type II error will make many men unhappy.

Now, this is a give and take relationship. Every time that we make p smaller (.005, .0005, .00005, etc.) for "accuracy," then the risk of committing a type II error increases. A type II error when determining what games 15 year olds like to play doesn't really matter if we are wrong but if we start talking about drugs and false positives then the increased risk of a type II error really can make things ugly.

Next, there are guideline for determining a how many participants are needed for lower p (alpha) values. Social sciences (hold back your Sheldon jokes) that do studies on students might need lets say 35 subjects/people per treatment group at p=.05 whereas with a .005 might need 200 or 300 per treatment group. I don't have a stats book in front of me but .0005 could be in the thousands. Every adjustment impacts a different item in a negative fashion. You can have your Death Star or you can have Luke Skywalker. Can't have 'em both.

Finally, there is a statistical concept of power, that is, there are stats for measuring the impact of a treatment. Basically, how much of the variance between the group A and group B can be assigned to the experimental treatment. This takes precedence in many peoples minds over simply determining if we have a correct or incorrect hypothesis. Assigning p does not answer this.

Anyways, I'm going to go have another beer. Discard this article and move onto greener pastures.

• #### And has been so for, oh, 50 years? (Score:2)

"innovative methods"??? I do not know of a single serious scientist who hasn't been lectured on the ills of weak testing (and told not to use 0.05 as some sort of magical threshold below which everything magically works).

Back when I was a wee researchling, this [berkeley.edu] is literally one of the first paper I was told to read and internalise (published 20 years ago, and not even particularly breakthrough at the time).

There is absolutely no need for new evidence or further discussion of the limitations of statisti
• #### Nice method he's developed there... (Score:2)

... but is it reproducible? :p

#### Related LinksTop of the: day, week, month.

"The eleventh commandment was `Thou Shalt Compute' or `Thou Shalt Not Compute' -- I forget which." -- Epigrams in Programming, ACM SIGPLAN Sept. 1982

Working...