Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Math Technology

Finding a Needle in a Haystack of Data 173

Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]
This discussion has been archived. No new comments can be posted.

Finding a Needle in a Haystack of Data

Comments Filter:
  • Google (Score:4, Interesting)

    by biocute ( 936687 ) on Wednesday December 07, 2005 @04:45PM (#14204954)
    Does Google have the technology to do this kind of scientific searches yet?

    If it does, it sure can save these researchers a lot of time; If it doesn't, I'm sure Google will be keen to get involved, especially on the "isolate useful signals buried in large datasets" part.
  • by Billosaur ( 927319 ) * <<wgrother> <at> <optonline.net>> on Wednesday December 07, 2005 @04:48PM (#14204983) Journal

    I see this as being a boon to SETI [berkeley.edu]. If there was ever a needle in a haystack, it's trying to tease a possible intelligent signal out of the cosmic background noise. If you have an idea what the background is like in general, then it's far easier to detect an abnormality in that background noise. The question will end up being, are we simply detecting more false positives or are these real signals?

  • by tomzyk ( 158497 ) on Wednesday December 07, 2005 @04:57PM (#14205063) Journal
    FYI: Its abbreviation is not "CWRU" anymore. As of about 2 years ago, they changed it to simply "Case" and gave it the silly new logo of 2 paperclips stuck together.

    Why? I have no idea. Some "university branding" thing that some people thought was important to the growth of the campus or something. Apparently it ticked a bunch of alumni (from the original Western Reserve University) too.

    Knowing is half the battle.
  • by ahmusch ( 777177 ) on Wednesday December 07, 2005 @05:33PM (#14205359)
    Current fraud detection systems in use in the financial industry are based on two primary knowledge bases:

    1. A knowledge of your purchasing pattern as a consumer. To wit, having a statistically significant sample of what are valid transactions as well as knowing your credit score and income.

    Do you shop at high-end stores? Do you use your card for primarily travel and entertainment? Do you use your card for everyday purchases? How much of your line-of-credit do you tend to use?

    2. A comparison of recent transactions. For example:

    A sudden wave of big-ticket purchases very close together in time, such as hitting a Best Buy the same day as buying jewelry.

    A single card making multiple high-value transactions (3 or more) within an hour.

    A pattern of unattended-auth-transaction (think pay-at-the pump) to big ticket purchase to unattended-auth and back.

    Using geometric statistical analysis could only complement pattern analysis in any case, and I fail to see how it's superior to the existing behavior scoring algorithms which are based on an individual's past history, weighting each new transaction to determine if it's "out of profile", and if so, by what margin. Sometimes the fraud is only revealed by several transactions scoring progressively higher on the fraud-o-meter, and I suspect the geometric statistic analysis would fail to trigger that as an event, as it would be a continuation of the pattern.

    My ability to read statistics papers is sadly out of date. Anyone want to give a shot at translating this into non-doctoral English?
  • by Lord Byron II ( 671689 ) on Wednesday December 07, 2005 @06:48PM (#14205962)
    As a particle physicist I know exactly the kind of challenge that this is. The SNR is horrible, you've got tons of data, and the data may be distorted by all sorts of sources (background, misalignment, the wrong reaction, etc).

    I also know that these sorts of algorithms are created all of the time. In fact, someone in my lab got his Ph.D. for applying a neural network to this problem. Furthermore, these algorithms are not "plug-n-play". They must be manually adjusted, by a team with a deep in-depth knowledge of the system in order to be useful.

    So trust me when I say that Roland has blown this out of proportion. Congratulations to the CWRU team for getting the PRL paper published, but this is hardly the kind of ground-breaking news that deserves to be on Slashdot.

  • by martin-boundary ( 547041 ) on Wednesday December 07, 2005 @09:27PM (#14206837)
    I don't want to rain on the parade, but the result is quite possibly wrong.

    If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no problem at all with the Breit-Wigner tails.

    I think their derivation is quite possibly wrong.

If you have a procedure with 10 parameters, you probably missed some.

Working...