Finding a Needle in a Haystack of Data 173

Posted by ScuttleMonkey on Wednesday December 07, 2005 @04:44PM from the mathematical-sieve dept.

Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]

This discussion has been archived. No new comments can be posted.

Finding a Needle in a Haystack of Data

Load All Comments

Search 173 Comments Log In/Create an Account

Comments Filter:

Google (Score:4, Interesting)

by biocute ( 936687 ) writes: on Wednesday December 07, 2005 @04:45PM (#14204954)

Does Google have the technology to do this kind of scientific searches yet?

If it does, it sure can save these researchers a lot of time; If it doesn't, I'm sure Google will be keen to get involved, especially on the "isolate useful signals buried in large datasets" part.

Share
twitter facebook
- Re:Google (Score:1)
  
  by paulsgre ( 890463 ) writes:
  
  But can it find potential girlfriends for Slashdotters? Now that's what I would call isolating a useful (and rare) signal buried in a large dataset. When i see THOSE results, I will be impressed.
  And if so, I've got a useful signal that could use some burying...
  - Re:Google (Score:3, Funny)
    
    by sapped ( 208174 ) writes:
    
    But can it find potential girlfriends for Slashdotters?
    
    Wow. There really are't any out there. Check it out on google [google.com] yourselves.
    
    The same results come back in images, groups, news, etc. Man. What a sad bunch.
    - Re:Google (Score:2, Funny)
      
      by zopf ( 897522 ) writes:
      
      Bah! Not even in Froogle... ;)
  - Re:Google (Score:2)
    
    by X0563511 ( 793323 ) * writes:
    
    it's not that hard. Go outside. Talk to people. Listen (thats the important part).
    
    Give it a few months and you will be suprised!
    
    (Of course, get yourself in shape - not too hard. an hour 5 days out of the week on a cycle or treadmill will do the trick. lay off the sugary and fatty snacks)
    
    Ive even had propositions! THEY came to ME!
- Re:Google (Score:3, Funny)
  
  by garcia ( 6573 ) writes:
  
  Does Google have the technology to do this kind of scientific searches yet?
  
  It's only in Beta thus it's not useful ;-)
Was it just me or was this story broken at first? (Score:1)

by Wisgary ( 799898 ) writes:

It just refused to load for me.
- Re:Was it just me or was this story broken at firs (Score:4, Funny)
  
  by MarkGriz ( 520778 ) writes: on Wednesday December 07, 2005 @05:07PM (#14205132)
  
  "It just refused to load for me."
  
  Maybe your interest in the story was deemed statistically insignificant.
  
  Parent Share
  twitter facebook
  - Re:Was it just me or was this story broken at firs (Score:1)
    
    by Wisgary ( 799898 ) writes:
    
    So... I'm just another piece of hay... :(
Indexes (Score:1)

by CastrTroy ( 595695 ) writes:

All you have to do is index it properly, and lots of data can be searched really fast.
- Re:Indexes (Score:2, Informative)
  
  by Husgaard ( 858362 ) writes:
  
  They are trying to efficiently find a signal in random and chaotic data. Random and chaotic data isn't easy to index.
  - Re:Indexes (Score:2)
    
    by CastrTroy ( 595695 ) writes:
    
    But that's the trick. Finding a good way to index the data.
    - Re:Indexes (Score:2)
      
      by SatanicPuppy ( 611928 ) writes:
      
      I don't see that there would be any point in indexing it...In an index you're atomizing it down to it's individual meaningless parts. Each and every part is therefore solitary in an index, and cannot be related to any other part of the index in a meaningful way, because all the other parts are equally unrelated to anything and meaningless as well.
      
      It would be more useful to transform the apparently random data in some way so as to make signals or discrepancies buried in it obvious. There are all kinds of fun
- Re:Indexes (Score:1)
  
  by Marko DeBeeste ( 761376 ) writes:
  
  If we had ham, we could have ham and eggs. If we had eggs.
The most obvious application (Score:5, Interesting)

by Billosaur ( 927319 ) * writes: <wgrother AT optonline DOT net> on Wednesday December 07, 2005 @04:48PM (#14204983) Journal

I see this as being a boon to SETI [berkeley.edu]. If there was ever a needle in a haystack, it's trying to tease a possible intelligent signal out of the cosmic background noise. If you have an idea what the background is like in general, then it's far easier to detect an abnormality in that background noise. The question will end up being, are we simply detecting more false positives or are these real signals?

Share
twitter facebook
- Seti (Score:2)
  
  by jurt1235 ( 834677 ) writes:
  
  Also the first "usefull" application for this kind of technique which popped up in my head. Actually, the process in my head to make this one item popup is maybe usefull too (-: Lot of random data, and this one is being associated with the article.
- Re:The most obvious application (way OT) (Score:2)
  
  by blackcoot ( 124938 ) writes:
  
  it's been a while since i last did much perl, but shouldn't the last line of your sig be:
  
  ($world = $world) =~ s/bad/good/g;
  
  otherwise you're making your world better but not ever doing anything with it...
  - Re:The most obvious application (way OT) (Score:2)
    
    by geekd ( 14774 ) writes:
    
    Well, he's making HIS world better. Apparently, he can give a crap about the rest of us.
- Re:The most obvious application (Score:2)
  
  by SETIGuy ( 33768 ) writes:
  
  I see this as being a boon to SETI.
  I'll read the article tonight and find out if it's applicable and whether it's better than what we are using. In the SETI@home client processing we already take into account the anticipated form of the signal, so I'm not sure this buys us anything. In fact, other than the exact mathematical description as a multidimensional manifold the text makes it appear that we're already using this technique in our searches for repeated pulses and signals matching the Gaussian pr
- - Re:The most obvious application (Score:2)
    
    by gibodean ( 224873 ) writes:
    
    "Not to say there isn't ET life out there, but until some evidenc points to it, we may as well assume that the universe is empty except for us."
    
    Hmm, well how would you go about testing that hypothesis that there's no life out there ? Or that there is life out there ?
    
    Well, you could look into space !!!!!
    
    Gee, wouldn't it be good if someone was doing that to provide the evidence one way or another ?
    
    I've got no problem with Seti. I think they're going to fail to detect intelligent life, but as long as their ai
Ya' know... (Score:3, Funny)

by jacobcaz ( 91509 ) writes: on Wednesday December 07, 2005 @04:49PM (#14204988) Homepage

82.67% of all statistics are made up anyway...

Share
twitter facebook
- Re:Ya' know... (Score:2, Funny)
  
  by saskboy ( 600063 ) writes:
  
  "82.67% of all statistics are made up anyway..."
  
  Well yeah, 50% of all statisticians finished in the bottom half of their class.
  - Re:Ya' know... (Score:2)
    
    by Tony Hoyle ( 11698 ) writes:
    
    Not necessarily... only works if there are an even number of statisticians, and if nobody scored the mean score.
    
    eg. if there 100 statisticians, the mean score is 37 and 10 statisticians scored that, only 45% of statisticians are techincally in the bottom half (and 45% in the top half). 10% are exactly in the middle.
    
    You could say that the 10% are in both the bottom and top half... in which case 55% are in the bottom half and 55% are in the top half!!
    - Re:Ya' know... (Score:3, Funny)
      
      by $RANDOMLUSER ( 804576 ) writes:
      
      Jeez. How anal. You should take some time and count the flowers.
    - Re:Ya' know... (Score:2)
      
      by saskboy ( 600063 ) writes:
      
      "You could say that the 10% are in both the bottom and top half..."
      
      I'm not sure now if a comment like that puts you in the top half, or bottom half. :-P
- Re:Ya' know... (Score:1)
  
  by Funakoshi ( 925826 ) writes:
  
  Very true. Also interesting is that 95% of men like to use statistics to seem more intelligent...
- I hope YOU know that ... (Score:1)
  
  by dazey ( 903451 ) writes:
  
  from the moment you posted that comment, the value you gave increased just a little bit more ...
Sounds useful. (Score:2, Funny)

by RandoX ( 828285 ) writes:

I can't even find my keys some days.
- Re:Sounds useful. (Score:1)
  
  by TheComputerMutt.ca ( 907022 ) writes:
  
  And how would this help you with your inabbility to locate them?
  
  It wouldn't, that's how.
- You ever thought [command]-[F] while... (Score:1)
  
  by atrocious cowpat ( 850512 ) writes:
  
  .. looking for stuff on your [real world] desktop?
  
  I have, have actually had my arm and fingers twitching for the keyboard...
  
  I think I need a major vacation soon, somewhere with no IT-devices whatsoever.
  
  a.c.
- Re:Sounds useful. (Score:1)
  
  by tzot ( 834456 ) writes:
  
  I can't even find my keys some days.
  
  Really? [obnoxiousfumes.com]
The Real Challenge is Further Off (Score:4, Funny)

by AthenianGadfly ( 798721 ) writes: on Wednesday December 07, 2005 @04:50PM (#14205000)

"But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people."

When asked about more advanced applications for the technology, researchers replied it will probably be "quite a while" before the technology could be used for extremely high noise environments. Said one, "I mean, it's going to be a long time before we're up to finding finding useful comments on Slashdot or something."

Share
twitter facebook
Numb3rs (Score:2, Funny)

by vanyel ( 28049 ) * writes:

Sounds like they've been watching Numb3rs ;-)
- Re:Numb3rs (Score:5, Funny)
  
  by Shadow Wrought ( 586631 ) writes: <shadow.wrought@NOSPam.gmail.com> on Wednesday December 07, 2005 @04:58PM (#14205079) Homepage Journal
  
  A favorite quote, "Physicists see equations as a reflection of reality, Engineers see reality as a reflection of equations; Mathematicians have never made the connection."
  
  Parent Share
  twitter facebook
  - Nice quote (Score:2)
    
    by jim_deane ( 63059 ) writes:
    
    Do you have a source for that quote?
    
    It's a great quote, I'd love to be able to use it and attribute it properly.
    
    Jim
    - Re:Nice quote (Score:2)
      
      by Shadow Wrought ( 586631 ) writes:
      
      Alas no. I remember hearing it in college (ten years ago) but exactly where, when, and from whom I heard it have long since been dropped out of my memory banks:-(
Now that's a change... (Score:2, Funny)

by Havenwar ( 867124 ) writes:

The Case team discovered a technique that is built on the principle of comparing a set of summary characteristics for any sub region of the observations with the background variation. From these characteristics, attempts are made to find small regions that appear significantly different from the background--a difference that cannot simply be attributed to random chance

So, basically its the one search engine that can only find the words "horny teen nekkid" if it is NOT on a pr0n-page. I can see uses for that
- - Re:Now that's a change... (Score:1)
    
    by Havenwar ( 867124 ) writes:
    
    So the first job of this marvelous search engine statistical method... is to find the one person in a bunch of perverts that would appreciate the results.
    
    jack thompson.
    
    see, didn't need a search engine. and why pray tell would he want to find the one page that said "nekkid teen sexy" and wasn't a pr0n page? Oh to press charges of course. Pages like that corrupt our young! Never mind the real pr0n, thats so.. out there. It's the one page that mentions a single naughty word thats in for the trouble!
PDF Warning? (Score:1)

by Anonymous Coward writes:

Why do we need to be warned that it's a PDF? I can understand an "MS Word Warning" but PDF is platform independent. What's wrong with PDF?
- Re:PDF Warning? (Score:2)
  
  by flood6 ( 852877 ) writes:
  
  My problem with them is that one of my work PCs is very old but still fine for browsing the internet. Clicking on a link on this machine that I did not realize was a PDF sets off a long and tedious series of about 3 minutes where FF locks up until the Acrobat Reader plugin loads, then it downloads and displays the PDF, then scrolling through the file itself is really jumpy, then I have to close it which is slow and sometimes crashes FF.
  Even on my faster PCs, reading a large PDF feels slower than it shoul
  - Re:PDF Warning? (Score:2)
    
    by bogado ( 25959 ) writes:
    
    Why don't you unisntall the plugin in this machine then? It seems to me that the plugin is useless since you're never going to want to use it anyway. Keep the reader as a stand alone app, so you can still view the pdfs if you want to.
- - Re:PDF Warning? (Score:2)
    
    by rco3 ( 198978 ) writes:
    
    "PDF documents are not handled directly by your browser."
    
    They most certainly are. Of course, I'm not using IE or FF. In any case, that doesn't really justify the warning.
9...9...9...9... (Score:1)

by r3adah3ad ( 936993 ) writes:

"...a difference that cannot simply be attributed to random chance..." If it's random, how do you know?
- Re:9...9...9...9... (Score:2)
  
  by chill ( 34294 ) writes:
  
  Random has NO pattern what so ever. By detecting a pattern, however small, implies non-random data. QED
  
  -Charles
  - Re:9...9...9...9... (Score:3, Insightful)
    
    by Stonehand ( 71085 ) writes:
    
    Not really.
    
    The more you constrain your allegedly random process, such as by insisting that it produce output without "patterns" -- whatever those are -- the less random it actually is.
    
    To put it in more concrete terms, which is more random -- a coin which flips 50-50 heads/tails with no other constraints whatsoever, or a coin which flips 50-50 but will never, say, flip 100 heads in a row, and will never exactly alternate, and will never produce the bit sequence corresponding to the ASCII encoding of the text
  - Re:9...9...9...9... (Score:2)
    
    by CoolVibe ( 11466 ) writes:
    
    If you have an infinite amount of random data, every pattern will be in there somewhere. At least, that's what I was led to believe.
    - Re:9...9...9...9... (Score:2)
      
      by chill ( 34294 ) writes:
      
      If you have an infinite amount of random data, every pattern will be in there somewhere. At least, that's what I was led to believe.
      
      Yes, but only if you look at smaller segments, which changes your dataset. For example, if you spot the first 30 digits of Pi in an infinitely random set, the question becomes is your random set Pi? If not, the pattern only applies to those 30 digits and thus your set changes and is no longer the infinite set of random data.
      
      And they aren't dealing with an "infinite" set, but
    - - Re:9...9...9...9... (Score:2)
        
        by CoolVibe ( 11466 ) writes:
        
        They're in there too. Infinity is tricky.
        
        If you have an infinite amount of hey, and throw in an infinite amount of needles, you'll still be spending a lot of time finding the needles. :)
- Re:9...9...9...9... (Score:5, Insightful)
  
  by flynt ( 248848 ) writes: on Wednesday December 07, 2005 @05:04PM (#14205117)
  
  Whether you "know" or not is always up for debate, but that's usually for epistemology class. In classical hypothesis testing in statistics, you make a distributional assumption about your data, and then calculate a probability from the data you observed (the p-value) given your initial assumption. If this probability is very low (also an interpretation), you assume your initial distributional assumption was incorrect. There are finer points to it of course, but classical hypothesis testing in statistics is pretty much a reductio ad absurdem in logic.
  
  Parent Share
  twitter facebook
Case Western Reserve University (Score:4, Interesting)

by tomzyk ( 158497 ) writes: on Wednesday December 07, 2005 @04:57PM (#14205063) Journal

FYI: Its abbreviation is not "CWRU" anymore. As of about 2 years ago, they changed it to simply "Case" and gave it the silly new logo of 2 paperclips stuck together.

Why? I have no idea. Some "university branding" thing that some people thought was important to the growth of the campus or something. Apparently it ticked a bunch of alumni (from the original Western Reserve University) too.

Knowing is half the battle.

Share
twitter facebook
- Re:Case Western Reserve University (Score:2)
  
  by Manhigh ( 148034 ) writes:
  
  The name of the school is still Case Western Reserve University.
  
  Despite the fact that its OK to officially call it 'Case' now (it wasnt OK to do so in '97), CWRU is still a valid abbreviation. Plus I paid so much money to that place that I'll call it whatever I damn well please.
  
  - '02
- Re:Case Western Reserve University (Score:2, Funny)
  
  by Anonymous Coward writes:
  
  Actually, its not two paper clips together. It's a fat man holding a surf board. Look for yourself [case.edu]
- Re:Case Western Reserve University (Score:2)
  
  by ThosLives ( 686517 ) writes:
  
  I have to say, I'm glad that my alma mater (Case School of Engineering, 2000) is actually still doing real science. I'm kind of disappointed at all the folks above who posted about "finding useful information in the noise of internet information" though; that type of information gathering is not quite the same as discerning between special-cause and random-cause fluctuations in a signal (mostly because the Internet consists mostly of special-cause variation: i.e., things people have written or created). Dis
- Re:Case Western Reserve University (Score:2)
  
  by Dachannien ( 617929 ) writes:
  
  More on the logo. [case.edu]
  
  The true offense in the OP was calling it "Case Western". It's not a "reserve university", whatever that means.
  
  I've always just called it "Case" since I started there as an undergraduate in 1994, while my e-mail address still contains cwru.edu. Both of those are used now - "Case" just validates the fact that most people really get tired of saying the whole name over and over again.
Speaking of needle in a haystack ... (Score:5, Funny)

by airrage ( 514164 ) writes: on Wednesday December 07, 2005 @04:59PM (#14205081) Homepage Journal

Someone asked me to give ten different ways to find a needle in a haystack, these are my thoughts:

1) INDUSTRIAL MAGNENT
2) BLIND LUCK
3) BURN THE HAY, PICK UP THE NEEDLE
4) STATISTICAL ANALYSIS (SINCE NEEDLES IN HAYSTACKS ARE NOT PLACED AT RANDOM, THEY ARE SUBJECT TO REGRESSION ANALYSIS)
5) OFFSHORE TO CHINA WHERE LABOR IS CHEAPER, SEARCH THE HAY WITH 10000 OF WORKERS.
6) WAIT YEARS UNTIL THE HAY DECAYS, PICK UP THE NEEDLE
7) SPREADOUT THE HAY, HIRE BAREFOOT HAY WALKERS
8) TAKE ALL THE HAY, PUT IN A POOL OF WATER - HAY WILL FLOAT, AND NEEDLE WILL SINK
9) LET COWS EAT THE HAY, X-RAY ALL THE COWS!
10) TRIAL AND ERROR - ONE PERSON

Share
twitter facebook
- Mythbusters (Score:3, Informative)
  
  by everphilski ( 877346 ) writes:
  
  Mythbusters actually did an ep where they built two different needle-in-haystack finding machines, one actually did quite well...
  
  -everphilski-
  - Re:Mythbusters (Score:2)
    
    by Tony Hoyle ( 11698 ) writes:
    
    Their solutions were kinda destructive though.
    
    I'd like to see a way of finding a needle in a haystack that left you with a (largely) intact haystack afterwards, not a pile of ash or a wet sludge.
    
    Huge inductive coils would be a good start... probably wouldn't find the bone one though - maybe some kind of MRI?
  - Re:Mythbusters (Score:2)
    
    by pipingguy ( 566974 ) writes:
    
    Was that the one where Kari(sp?) mugged and did things for the camera in the cutesy, girly way? Stupid me, that pretty much describes every episode she's been in.
    
    Scotty was a no-bullshit welder (and very attractive, to boot). Bring her back, she's a *real* babe.
- Re:1) INDUSTRIAL MAGNENT (Score:2, Funny)
  
  by Anne_Nonymous ( 313852 ) writes:
  
  1) INDUSTRIAL MAGNET
  
  DBAs everwhere are cringing and covering their data.
- Re:Speaking of needle in a haystack ... (Score:2)
  
  by pherthyl ( 445706 ) writes:
  
  I've got another one..
  
  11) LET COWS EAT THE HAY, DISECT DEAD COW
  
  lameness filter blah
  - Re:Speaking of needle in a haystack ... (Score:2)
    
    by iamlucky13 ( 795185 ) writes:
    
    Ok, time for another one of iamlucky13's little-known redneck nerd facts
    
    Category: Cattle
    Entry: 1097
    Ranchers will commonly intentionally force feed a smooth magnet [magnetsource.com] to calves. Because of it's weight, it will remain in the rumen or reticulum (the 1st and 2nd stomach compartments, respectively) for the life of the cow. Fields often have stray bits of metal small enough to be accidentally ingested while grazing, such as barb wire bits, fence staples, screws, etc. When stuck on the magnet, the pieces are eff
- - Re:Speaking of needle in a haystack ... (Score:2)
    
    by Lehk228 ( 705449 ) writes:
    
    that only works when you gently place the needle on the surface. when you plop it in with an assload of hay it will sink.
    
    the problem is you have to churn the hay so the needle won't get stuck to the floating hay.
To find a signal in a sea of noise... (Score:1)

by San Francisco ( 936406 ) writes:

Perhaps this technology can make Usenet useful once again.
Maybe Slashdot can use it to find dupes (Score:1)

by kk49 ( 829669 ) writes:

End
Of
Message
- Maybe Slashdot can use it to find dupes (Score:1)
  
  by Bloggins ( 783115 ) writes:
  
  End of Message
- Re:Maybe Slashdot can use it to find dupes (Score:1)
  
  by UMEngin ( 895769 ) writes:
  
  That would be like finding the hay in the haystack.
- Got your ratio reversed (Score:2)
  
  by quanticle ( 843097 ) writes:
  
  In Slashdot, the dupe to original article ratio is so high, its the original articles that need finding, not the dupes. Funny, though, from what I've seen, it seems like this particular algorithm would be quite efficient in doing that (e.g. it specializes in finding the data that is different, versus categorizing existing data).
SETI? (Score:2)

by ruiner13 ( 527499 ) writes:

Would this be useful to reduce the computations needed for the SETI@Home folks too? Seems they have a bit of data to sort through... Hell, genetic enginering too. Look for useful patterns in hundreds of DNA strands.
Mythbusters did this... (Score:2)

by slashname3 ( 739398 ) writes:

Mythbusters did this one already. They built two machines/processes to find needles in haystacks. One used a process to burn away the hay leaving the needles and the other used magnets and gravity to separate the needles from the hay.

Oh, wait. Their talking about data. Never mind.
We are at the horizon of a cultural singularity... (Score:1)

by Errandboy of Doom ( 917941 ) writes:

THE SINGULARITY

Throughout history, we championed the content creator. Only a tiny fraction of the population could write or understood math or science. Only a tiny fraction could dedicate themselves to the arts.

Most individuals' time was consumed by being agrarian generalists: they owned a farm, and they were constantly occupied by all the repairs and maintenance of their property. It wasn't a job, it was a way of life. But now, more and more, our economy makes us all incredible specialists. We're conf
- Re:We are at the horizon of a cultural singularity (Score:2)
  
  by fishybell ( 516991 ) writes:
  
  shut your pie hole
Significant % of patterns in randomness (Score:3, Informative)

by G4from128k ( 686170 ) writes: on Wednesday December 07, 2005 @05:32PM (#14205350)

Looking for possible patterns in large volumes of data is dangerous because of the high chance that random data will fit some of the myriad patterns tried. If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical significance level of 5% (even if the data is 100% random). "Cancer clusters" are an excellent example of this -- if you slice a dice a population enough different ways you are bound to find some geographic/demographic/ethnographic subgroup with a very high chance of some cancer.

Its better to either have a a priori hypothesis to look for one specific, pre-defined pattern in a mound data than to see if any pattern is in the data. Or, if one insists on looking for many patterns, then the standards for statistical significance must be correspondingly higher.

Share
twitter facebook
- Re:Significant % of patterns in randomness (Score:5, Informative)
  
  by zex ( 214881 ) writes: on Wednesday December 07, 2005 @05:54PM (#14205567) Homepage
  
  If you test data against a thousand possible patterns, then about 50 of them will be found to be present at a statistical level of 5% (even if the data is 100% random).
  
  If you're not correcting for multiple hypothesis testing, you are correct. If you do have 100% random data that holds to perfect randomness at all scales (which I'm not sure is even possible) and correct for multiple hypothesis testing, then you'll find exactly what you "should" find: no significant pattern.
  
  You mention "Cancer clusters" as an example of attribution of significance to insignificant findings. However, these clusters are often found (at least in the genetics research realm) by hierarchical clustering, which is self-correcting for multiple hypothesis testing. If you're speaking of demographic surveys which find that (e.g.) "black females in Tahiti who were exposed to .... are more susceptible to brain cancer", then you're probably right. I too see those as examples of restricting the domain of samples until you find a pattern - but the pattern nonetheless exists.
  
  Parent Share
  twitter facebook
- Re:Significant % of patterns in randomness (Score:2)
  
  by hackstraw ( 262471 ) * writes:
  
  Looking for possible patterns in large volumes of data is dangerous because of the high chance that random data will fit some of the myriad patterns tried.
  
  No, God put the figure of Jesus in the sky, but made it not look too much like Jesus just to test the difference between the believers and non-believers. Trust me, it was not easy to do all that with nobody looking.
SETI? (Score:2)

by Nom du Keyboard ( 633989 ) writes:

SETI?
Regarding fraudulent transactions... (Score:2, Interesting)

by ahmusch ( 777177 ) writes:

Current fraud detection systems in use in the financial industry are based on two primary knowledge bases:

1. A knowledge of your purchasing pattern as a consumer. To wit, having a statistically significant sample of what are valid transactions as well as knowing your credit score and income.

Do you shop at high-end stores? Do you use your card for primarily travel and entertainment? Do you use your card for everyday purchases? How much of your line-of-credit do you tend to use?

2. A comparison of recent
Hey, wait a minute! (Score:4, Funny)

by $RANDOMLUSER ( 804576 ) writes: on Wednesday December 07, 2005 @05:40PM (#14205434)

An article posted by Roland Piquepaille with no links back to his site???
WTF? Roland? You feeling OK?

Share
twitter facebook
- Re:Hey, wait a minute! (Score:2)
  
  by Dachannien ( 617929 ) writes:
  
  His name links to his website, so he still gets the pagerank boost. Beatles-Beatles does the same thing, and ScuttleMonkey the Sock Puppet posts his stories, too.
Relationship to other information theory concepts? (Score:2)

by Chilltowner ( 647305 ) writes:

Sort of a dilettante question, but I've been researching using entropy and information gain here at work and some of what they're talking about in the article and the paper seems familiar, though I'm not skilled enough in stats yet to make much out of it. It seems to me to be fairly similiar to how you get an information gain score. If you can classify the background as such, you should be able to sift through data with however many parameters you want and find the parameters that cause the greatest diffe
As a particle physicist (Score:5, Interesting)

by Lord Byron II ( 671689 ) writes: on Wednesday December 07, 2005 @06:48PM (#14205962)

As a particle physicist I know exactly the kind of challenge that this is. The SNR is horrible, you've got tons of data, and the data may be distorted by all sorts of sources (background, misalignment, the wrong reaction, etc).
I also know that these sorts of algorithms are created all of the time. In fact, someone in my lab got his Ph.D. for applying a neural network to this problem. Furthermore, these algorithms are not "plug-n-play". They must be manually adjusted, by a team with a deep in-depth knowledge of the system in order to be useful.
So trust me when I say that Roland has blown this out of proportion. Congratulations to the CWRU team for getting the PRL paper published, but this is hardly the kind of ground-breaking news that deserves to be on Slashdot.

Share
twitter facebook
- - Re:As a nervous system. (Score:3, Funny)
    
    by mako1138 ( 837520 ) writes:
    
    Here's a couple TB of data. Find me all the top quark candidates by tomorrow.
Yarbles (Score:2)

by Ranger ( 1783 ) writes:

I often have the same feeling about Slashdot. it's like a big haystack, but the needles are larger and easier to find. I have noticed that the Roland Piquepaille needles happen to the most worthless. The obvious solution for finding the proverbial needle in the haystack of data is to make it up. It's not like there's any real world [blueyonder.co.uk] examples. [thenation.com]
Develop, not Discover (Score:2)

by John Newman ( 444192 ) writes:

From the title of TFA, "Case researchers discover methods to find 'needles in haystack' in data". Pet peeve of mine, new techniques are not "discovered", they are "developed" (or something similar). Henry Ford did not discover the Model T by peering though a microscope, and CowboyNeal did not discover SlashCode by analyzing reams of code observations. It may be semantic nit-picking, but I think saying that the researchers just discovered this (surely insanely complex) bit of mathematical analysis takes away
What I perceive as a bigger problem... (Score:2)

by Ogemaniac ( 841129 ) writes:

is the overwhelming size of the literature. It is getting harder and harder to find the information that you need among a sea of near misses. Even to stay on top of one's subfield would require reading at least five journal papers a day, which is a significant undertaking even before you have to spend large amounts of time hunting for papers. For example, I am a chemist. It is generally not too difficult to find papers about a specific molecule - each molecule is assigned a specific ID number, which can
I don't want to rain on the parade (Score:4, Interesting)

by martin-boundary ( 547041 ) writes: on Wednesday December 07, 2005 @09:27PM (#14206837)

I don't want to rain on the parade, but the result is quite possibly wrong.
If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no problem at all with the Breit-Wigner tails.
I think their derivation is quite possibly wrong.

Share
twitter facebook
- Re:I don't want to rain on the parade (Score:2)
  
  by burns210 ( 572621 ) writes:
  
  "Now, everyone knows that..."
  You keep using that word. I do not thing it means what you think it means.
- Re:I don't want to rain on the parade (Score:2, Insightful)
  
  by jmtpi ( 17834 ) writes:
  
  martin-boundary wrote:
  If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no
- - Re:I don't want to rain on the parade (Score:3, Insightful)
    
    by martin-boundary ( 547041 ) writes:
    
    That's a good point. In the paper, the formula (2) is finite only if the tails of f dominate the tails of psi, so that means that f would have to be at least as fat tailed as the Cauchy. However, the paper doesn't attempt to state any assumptions, so it's hard to see which parts are solid and where there might be handwaving.
    Funnily enough, the density f they use in the monte carlo simulation appears to be truncated to be in the interval [0,2] (otherwise it wouldn't be integrable). That suggests that in pr
Needle/Haystack (Score:2)

by pdjohe ( 575876 ) writes:

Oh come on! It's not that hard! public static Object find(Object needle, Object[] haystack) { for (int i = 0; i < haystack.length; i++) if (haystack[i].equals(needle)) return needle; return null; }
Able Danger (Score:2)

by technoCon ( 18339 ) writes:

There are disputed reports that this sort of data mining was used to identify the terrorists who attacked the USS Cole and flew airplanes into the World Trade Center (the official 9/11 commission's findings notwithstanding). The project is well documented on the right-side of the web and was called "Able Danger." According to rumor the project was shut down after identifying Mohammed Atta but it also pointing to Condoleeza Rice and Hillary Clinton as potential foreign spies.

This raises the issue of false al
- Obligatory (Score:2)
  
  by drewzhrodague ( 606182 ) writes:
  
  "What does god want with a starship?" -Spock
  - Re:Obligatory (Score:1)
    
    by RetroGeek ( 206522 ) writes:
    
    Kirk said this, not Spock
    - Re:Obligatory (Score:2)
      
      by drewzhrodague ( 606182 ) writes:
      
      Oops! And here I thought I was Spock -- er, spot-on. =_)
    - Re:Obligatory (Score:1)
      
      by RadioD00d ( 714469 ) writes:
      
      Nope - it was McCoy
      - Re:Obligatory (Score:2)
        
        by fiannaFailMan ( 702447 ) writes:
        
        Kirk said it, then McCoy asked him what he was doing and said "you don't ask the almighty for his ID!"
- ITagging (Score:1, Funny)
  
  by Anonymous Coward writes:
  
  "What does God use to tag a galaxy with though?"
  
  Are you telling us there's such a thing as Intelligent Tagging?
- Re:Why is technology starting to... (Score:1)
  
  by sonofagunn ( 659927 ) writes:
  
  It means you're getting old!
- Re:But will it help me... (Score:2)
  
  by stinerman ( 812158 ) writes:
  
  I know the warranty will be void if you shave off the pubic hair yourself (intentional damage to the product), but you might want to try it anyway. Buy the hairless variety next time and you should be in good shape.
- Re:Roland Alert (Score:1, Funny)
  
  by Anonymous Coward writes:
  
  Where have you been?
  Don't you know the editors are in cahoots with the the Beatles Beatles guy now?
  Please, try to keep up with the conspiracy theories, mkay? Jeez!
- Re:Monte Carlo experimental results? (Score:2)
  
  by jim_deane ( 63059 ) writes:
  
  If it isn't monte carlo, it's the "random" (drunkard's) walk.
  
  You know, you GO into computational physics thinking it's all casinos and drinking, and this is what you find...

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Google (Score:4, Interesting)

Re:Google (Score:1)

Re:Google (Score:3, Funny)

Re:Google (Score:2, Funny)

Re:Google (Score:2)

Re:Google (Score:3, Funny)

Was it just me or was this story broken at first? (Score:1)

Re:Was it just me or was this story broken at firs (Score:4, Funny)

Re:Was it just me or was this story broken at firs (Score:1)

Indexes (Score:1)

Re:Indexes (Score:2, Informative)

Re:Indexes (Score:2)

Re:Indexes (Score:2)

Re:Indexes (Score:1)

The most obvious application (Score:5, Interesting)

Seti (Score:2)

Re:The most obvious application (way OT) (Score:2)

Re:The most obvious application (way OT) (Score:2)

Re:The most obvious application (Score:2)

Re:The most obvious application (Score:2)

Ya' know... (Score:3, Funny)

Re:Ya' know... (Score:2, Funny)

Re:Ya' know... (Score:2)

Re:Ya' know... (Score:3, Funny)

Re:Ya' know... (Score:2)

Re:Ya' know... (Score:1)

I hope YOU know that ... (Score:1)

Sounds useful. (Score:2, Funny)

Re:Sounds useful. (Score:1)

You ever thought [command]-[F] while... (Score:1)

Re:Sounds useful. (Score:1)

The Real Challenge is Further Off (Score:4, Funny)

Numb3rs (Score:2, Funny)

Re:Numb3rs (Score:5, Funny)

Nice quote (Score:2)

Re:Nice quote (Score:2)

Now that's a change... (Score:2, Funny)

Re:Now that's a change... (Score:1)

PDF Warning? (Score:1)

Re:PDF Warning? (Score:2)

Re:PDF Warning? (Score:2)

Re:PDF Warning? (Score:2)

9...9...9...9... (Score:1)

Re:9...9...9...9... (Score:2)

Re:9...9...9...9... (Score:3, Insightful)

Re:9...9...9...9... (Score:2)

Re:9...9...9...9... (Score:2)

Re:9...9...9...9... (Score:2)

Re:9...9...9...9... (Score:5, Insightful)

Case Western Reserve University (Score:4, Interesting)

Re:Case Western Reserve University (Score:2)

Re:Case Western Reserve University (Score:2, Funny)

Re:Case Western Reserve University (Score:2)

Re:Case Western Reserve University (Score:2)

Speaking of needle in a haystack ... (Score:5, Funny)

Mythbusters (Score:3, Informative)

Re:Mythbusters (Score:2)

Re:Mythbusters (Score:2)

Re:1) INDUSTRIAL MAGNENT (Score:2, Funny)

Re:Speaking of needle in a haystack ... (Score:2)

Re:Speaking of needle in a haystack ... (Score:2)

Re:Speaking of needle in a haystack ... (Score:2)

To find a signal in a sea of noise... (Score:1)

Maybe Slashdot can use it to find dupes (Score:1)

Maybe Slashdot can use it to find dupes (Score:1)

Re:Maybe Slashdot can use it to find dupes (Score:1)

Got your ratio reversed (Score:2)

SETI? (Score:2)

Mythbusters did this... (Score:2)

We are at the horizon of a cultural singularity... (Score:1)

Re:We are at the horizon of a cultural singularity (Score:2)

Significant % of patterns in randomness (Score:3, Informative)

Re:Significant % of patterns in randomness (Score:5, Informative)

Re:Significant % of patterns in randomness (Score:2)

SETI? (Score:2)

Regarding fraudulent transactions... (Score:2, Interesting)

Hey, wait a minute! (Score:4, Funny)

Re:Hey, wait a minute! (Score:2)

Relationship to other information theory concepts? (Score:2)

As a particle physicist (Score:5, Interesting)