Finding a Needle in a Haystack of Data 173

Posted by ScuttleMonkey on Wednesday December 07, 2005 @04:44PM from the mathematical-sieve dept.

Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]

Finding a Needle in a Haystack of Data

This discussion has been archived. No new comments can be posted.

Search 173 Comments Log In/Create an Account

Comments Filter:

Re:9...9...9...9... (Score:5, Insightful)

by flynt ( 248848 ) writes: on Wednesday December 07, 2005 @05:04PM (#14205117)

Whether you "know" or not is always up for debate, but that's usually for epistemology class. In classical hypothesis testing in statistics, you make a distributional assumption about your data, and then calculate a probability from the data you observed (the p-value) given your initial assumption. If this probability is very low (also an interpretation), you assume your initial distributional assumption was incorrect. There are finer points to it of course, but classical hypothesis testing in statistics is pretty much a reductio ad absurdem in logic.

Re:9...9...9...9... (Score:3, Insightful)

by Stonehand ( 71085 ) writes: on Wednesday December 07, 2005 @05:27PM (#14205292) Homepage

Not really.

The more you constrain your allegedly random process, such as by insisting that it produce output without "patterns" -- whatever those are -- the less random it actually is.

To put it in more concrete terms, which is more random -- a coin which flips 50-50 heads/tails with no other constraints whatsoever, or a coin which flips 50-50 but will never, say, flip 100 heads in a row, and will never exactly alternate, and will never produce the bit sequence corresponding to the ASCII encoding of the text of Rissanen's first paper on MML, and... ?

What the OP might want to look into is the notion of uncompressability, and perhaps Kolmogorov complexity. Of course, the latter is incomputable, but that's life.

Re:I don't want to rain on the parade (Score:1, Insightful)

by Anonymous Coward writes: on Wednesday December 07, 2005 @11:30PM (#14207610)

I'm not a physicist, and I haven't had enough time to really look over the paper thoroughly, but I am a statistician.

My reading of the paper is that the Cauchy distribution is mentioned only to partially define a distribution that is used in an example. That is, there is nothing about the Cauchy distribution that is necessary for their results to hold. The Cauchy distribution is only relevant in an example, and only to partly define a density. Note, furthermore, that nowhere in the paper do they discuss the expectation of a Cauchy density, only the expectation of a score statistic. They do mention in the example that the Cauchy density is "centered" at a point E_0, but that's possible, as the central tendency of a Cauchy can be defined by the median of the distribution [wolfram.com].

So you may be right, but I think that their discussion of the Cauchy doesn't detract from the rest of the paper.

Re:I don't want to rain on the parade (Score:3, Insightful)

by martin-boundary ( 547041 ) writes: on Thursday December 08, 2005 @12:34AM (#14207934)

That's a good point. In the paper, the formula (2) is finite only if the tails of f dominate the tails of psi, so that means that f would have to be at least as fat tailed as the Cauchy. However, the paper doesn't attempt to state any assumptions, so it's hard to see which parts are solid and where there might be handwaving.
Funnily enough, the density f they use in the monte carlo simulation appears to be truncated to be in the interval [0,2] (otherwise it wouldn't be integrable). That suggests that in practice, they really do everything on the interval [0,2], and the psi they present isn't really a Cauchy in the first place.
Oh well, rigour still isn't a strong point of physics ;)

Re:I don't want to rain on the parade (Score:2, Insightful)

by jmtpi ( 17834 ) writes: on Thursday December 08, 2005 @03:17AM (#14208532) Homepage

martin-boundary wrote:
If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no problem at all with the Breit-Wigner tails.

They use a Breit-Wigner because that's often a realistic model of the signal distribution, when one is talking about resonance production in a particle physics experiment. (My copy is at work, but I know this is discussed, for example, in Sakurai's Modern Quantum Mechanics.) I don't think this paper nearly lived up to the press release, and certainly isn't germane to Slashdot, but I don't think the use of a BW has anything to do with it.
On the other hand, I'm merely a particle physics grad student, and I didn't even attempt to read the center of the paper. If they really did come up with something that has more power than chi^2 (at least for an extremely simple fit) then that is notable. What would be really interesting would be for someone to come up with a real goodness-of-fit statistic for unbinned fits.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Finding a Needle in a Haystack of Data 173

Finding a Needle in a Haystack of Data More Login

Finding a Needle in a Haystack of Data

Re:9...9...9...9... (Score:5, Insightful)

Re:9...9...9...9... (Score:3, Insightful)

Re:I don't want to rain on the parade (Score:1, Insightful)

Re:I don't want to rain on the parade (Score:3, Insightful)

Re:I don't want to rain on the parade (Score:2, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot