Finding a Needle in a Haystack of Data 173

Posted by ScuttleMonkey on Wednesday December 07, 2005 @04:44PM from the mathematical-sieve dept.

Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]

Finding a Needle in a Haystack of Data

This discussion has been archived. No new comments can be posted.

Search 173 Comments Log In/Create an Account

Comments Filter:

Re:Indexes (Score:2, Informative)

by Husgaard ( 858362 ) writes: on Wednesday December 07, 2005 @04:55PM (#14205042)

They are trying to efficiently find a signal in random and chaotic data. Random and chaotic data isn't easy to index.

Mythbusters (Score:3, Informative)

by everphilski ( 877346 ) writes: on Wednesday December 07, 2005 @05:10PM (#14205153) Journal

Mythbusters actually did an ep where they built two different needle-in-haystack finding machines, one actually did quite well...

-everphilski-

Significant % of patterns in randomness (Score:3, Informative)

by G4from128k ( 686170 ) writes: on Wednesday December 07, 2005 @05:32PM (#14205350)

Looking for possible patterns in large volumes of data is dangerous because of the high chance that random data will fit some of the myriad patterns tried. If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical significance level of 5% (even if the data is 100% random). "Cancer clusters" are an excellent example of this -- if you slice a dice a population enough different ways you are bound to find some geographic/demographic/ethnographic subgroup with a very high chance of some cancer.

Its better to either have a a priori hypothesis to look for one specific, pre-defined pattern in a mound data than to see if any pattern is in the data. Or, if one insists on looking for many patterns, then the standards for statistical significance must be correspondingly higher.

Re:Significant % of patterns in randomness (Score:5, Informative)

by zex ( 214881 ) writes: on Wednesday December 07, 2005 @05:54PM (#14205567) Homepage

If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical level of 5% (even if the data is 100% random).

If you're not correcting for multiple hypothesis testing, you are correct. If you do have 100% random data that holds to perfect randomness at all scales (which I'm not sure is even possible) and correct for multiple hypothesis testing, then you'll find exactly what you "should" find: no significant pattern.

You mention "Cancer clusters" as an example of attribution of significance to insignificant findings. However, these clusters are often found (at least in the genetics research realm) by hierarchical clustering, which is self-correcting for multiple hypothesis testing. If you're speaking of demographic surveys which find that (e.g.) "black females in Tahiti who were exposed to .... are more susceptible to brain cancer", then you're probably right. I too see those as examples of restricting the domain of samples until you find a pattern - but the pattern nonetheless exists.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Finding a Needle in a Haystack of Data 173

Finding a Needle in a Haystack of Data More Login

Finding a Needle in a Haystack of Data

Re:Indexes (Score:2, Informative)

Mythbusters (Score:3, Informative)

Significant % of patterns in randomness (Score:3, Informative)

Re:Significant % of patterns in randomness (Score:5, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot