Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Math Technology

Finding a Needle in a Haystack of Data 173

Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]
This discussion has been archived. No new comments can be posted.

Finding a Needle in a Haystack of Data

Comments Filter:
  • by flynt ( 248848 ) on Wednesday December 07, 2005 @05:04PM (#14205117)
    Whether you "know" or not is always up for debate, but that's usually for epistemology class. In classical hypothesis testing in statistics, you make a distributional assumption about your data, and then calculate a probability from the data you observed (the p-value) given your initial assumption. If this probability is very low (also an interpretation), you assume your initial distributional assumption was incorrect. There are finer points to it of course, but classical hypothesis testing in statistics is pretty much a reductio ad absurdem in logic.
  • by Stonehand ( 71085 ) on Wednesday December 07, 2005 @05:27PM (#14205292) Homepage
    Not really.

    The more you constrain your allegedly random process, such as by insisting that it produce output without "patterns" -- whatever those are -- the less random it actually is.

    To put it in more concrete terms, which is more random -- a coin which flips 50-50 heads/tails with no other constraints whatsoever, or a coin which flips 50-50 but will never, say, flip 100 heads in a row, and will never exactly alternate, and will never produce the bit sequence corresponding to the ASCII encoding of the text of Rissanen's first paper on MML, and... ?

    What the OP might want to look into is the notion of uncompressability, and perhaps Kolmogorov complexity. Of course, the latter is incomputable, but that's life.
  • by Anonymous Coward on Wednesday December 07, 2005 @11:30PM (#14207610)
    I'm not a physicist, and I haven't had enough time to really look over the paper thoroughly, but I am a statistician.

    My reading of the paper is that the Cauchy distribution is mentioned only to partially define a distribution that is used in an example. That is, there is nothing about the Cauchy distribution that is necessary for their results to hold. The Cauchy distribution is only relevant in an example, and only to partly define a density. Note, furthermore, that nowhere in the paper do they discuss the expectation of a Cauchy density, only the expectation of a score statistic. They do mention in the example that the Cauchy density is "centered" at a point E_0, but that's possible, as the central tendency of a Cauchy can be defined by the median of the distribution [wolfram.com].

    So you may be right, but I think that their discussion of the Cauchy doesn't detract from the rest of the paper.
  • by martin-boundary ( 547041 ) on Thursday December 08, 2005 @12:34AM (#14207934)
    That's a good point. In the paper, the formula (2) is finite only if the tails of f dominate the tails of psi, so that means that f would have to be at least as fat tailed as the Cauchy. However, the paper doesn't attempt to state any assumptions, so it's hard to see which parts are solid and where there might be handwaving.

    Funnily enough, the density f they use in the monte carlo simulation appears to be truncated to be in the interval [0,2] (otherwise it wouldn't be integrable). That suggests that in practice, they really do everything on the interval [0,2], and the psi they present isn't really a Cauchy in the first place.

    Oh well, rigour still isn't a strong point of physics ;)

  • by jmtpi ( 17834 ) on Thursday December 08, 2005 @03:17AM (#14208532) Homepage
    martin-boundary wrote:
    If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no problem at all with the Breit-Wigner tails.

    They use a Breit-Wigner because that's often a realistic model of the signal distribution, when one is talking about resonance production in a particle physics experiment. (My copy is at work, but I know this is discussed, for example, in Sakurai's Modern Quantum Mechanics.) I don't think this paper nearly lived up to the press release, and certainly isn't germane to Slashdot, but I don't think the use of a BW has anything to do with it.

    On the other hand, I'm merely a particle physics grad student, and I didn't even attempt to read the center of the paper. If they really did come up with something that has more power than chi^2 (at least for an extremely simple fit) then that is notable. What would be really interesting would be for someone to come up with a real goodness-of-fit statistic for unbinned fits.

An Ada exception is when a routine gets in trouble and says 'Beam me up, Scotty'.

Working...