Finding a Needle in a Haystack of Data 173
Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]
Re:9...9...9...9... (Score:5, Insightful)
Re:9...9...9...9... (Score:3, Insightful)
The more you constrain your allegedly random process, such as by insisting that it produce output without "patterns" -- whatever those are -- the less random it actually is.
To put it in more concrete terms, which is more random -- a coin which flips 50-50 heads/tails with no other constraints whatsoever, or a coin which flips 50-50 but will never, say, flip 100 heads in a row, and will never exactly alternate, and will never produce the bit sequence corresponding to the ASCII encoding of the text of Rissanen's first paper on MML, and... ?
What the OP might want to look into is the notion of uncompressability, and perhaps Kolmogorov complexity. Of course, the latter is incomputable, but that's life.
Re:I don't want to rain on the parade (Score:1, Insightful)
My reading of the paper is that the Cauchy distribution is mentioned only to partially define a distribution that is used in an example. That is, there is nothing about the Cauchy distribution that is necessary for their results to hold. The Cauchy distribution is only relevant in an example, and only to partly define a density. Note, furthermore, that nowhere in the paper do they discuss the expectation of a Cauchy density, only the expectation of a score statistic. They do mention in the example that the Cauchy density is "centered" at a point E_0, but that's possible, as the central tendency of a Cauchy can be defined by the median of the distribution [wolfram.com].
So you may be right, but I think that their discussion of the Cauchy doesn't detract from the rest of the paper.
Re:I don't want to rain on the parade (Score:3, Insightful)
Funnily enough, the density f they use in the monte carlo simulation appears to be truncated to be in the interval [0,2] (otherwise it wouldn't be integrable). That suggests that in practice, they really do everything on the interval [0,2], and the psi they present isn't really a Cauchy in the first place.
Oh well, rigour still isn't a strong point of physics ;)
Re:I don't want to rain on the parade (Score:2, Insightful)
They use a Breit-Wigner because that's often a realistic model of the signal distribution, when one is talking about resonance production in a particle physics experiment. (My copy is at work, but I know this is discussed, for example, in Sakurai's Modern Quantum Mechanics.) I don't think this paper nearly lived up to the press release, and certainly isn't germane to Slashdot, but I don't think the use of a BW has anything to do with it.
On the other hand, I'm merely a particle physics grad student, and I didn't even attempt to read the center of the paper. If they really did come up with something that has more power than chi^2 (at least for an extremely simple fit) then that is notable. What would be really interesting would be for someone to come up with a real goodness-of-fit statistic for unbinned fits.