Finding a Needle in a Haystack of Data 173
Roland Piquepaille writes "Finding useful information in oceans of data is an increasingly complex problem in many scientific areas. This is why researchers from Case Western Reserve University (CWRU) have created new statistical techniques to isolate useful signals buried in large datasets coming from particle physics experiments, such as the ones run in a particle collider. But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people." Case Western has also provided a link to the original paper. [PDF Warning]
Google (Score:4, Interesting)
If it does, it sure can save these researchers a lot of time; If it doesn't, I'm sure Google will be keen to get involved, especially on the "isolate useful signals buried in large datasets" part.
Re:Google (Score:1)
Re:Google (Score:3, Funny)
Wow. There really are't any out there. Check it out on google [google.com] yourselves.
The same results come back in images, groups, news, etc. Man. What a sad bunch.
Re:Google (Score:2, Funny)
Re:Google (Score:2)
Give it a few months and you will be suprised!
(Of course, get yourself in shape - not too hard. an hour 5 days out of the week on a cycle or treadmill will do the trick. lay off the sugary and fatty snacks)
Ive even had propositions! THEY came to ME!
Re:Google (Score:3, Funny)
It's only in Beta thus it's not useful
Was it just me or was this story broken at first? (Score:1)
Re:Was it just me or was this story broken at firs (Score:4, Funny)
Maybe your interest in the story was deemed statistically insignificant.
Re:Was it just me or was this story broken at firs (Score:1)
Indexes (Score:1)
Re:Indexes (Score:2, Informative)
Re:Indexes (Score:2)
Re:Indexes (Score:2)
It would be more useful to transform the apparently random data in some way so as to make signals or discrepancies buried in it obvious. There are all kinds of fun
Re:Indexes (Score:1)
The most obvious application (Score:5, Interesting)
I see this as being a boon to SETI [berkeley.edu]. If there was ever a needle in a haystack, it's trying to tease a possible intelligent signal out of the cosmic background noise. If you have an idea what the background is like in general, then it's far easier to detect an abnormality in that background noise. The question will end up being, are we simply detecting more false positives or are these real signals?
Seti (Score:2)
Re:The most obvious application (way OT) (Score:2)
($world = $world) =~ s/bad/good/g;
otherwise you're making your world better but not ever doing anything with it...
Re:The most obvious application (way OT) (Score:2)
Re:The most obvious application (Score:2)
I'll read the article tonight and find out if it's applicable and whether it's better than what we are using. In the SETI@home client processing we already take into account the anticipated form of the signal, so I'm not sure this buys us anything. In fact, other than the exact mathematical description as a multidimensional manifold the text makes it appear that we're already using this technique in our searches for repeated pulses and signals matching the Gaussian pr
Re:The most obvious application (Score:2)
Hmm, well how would you go about testing that hypothesis that there's no life out there ? Or that there is life out there ?
Well, you could look into space !!!!!
Gee, wouldn't it be good if someone was doing that to provide the evidence one way or another ?
I've got no problem with Seti. I think they're going to fail to detect intelligent life, but as long as their ai
Ya' know... (Score:3, Funny)
Re:Ya' know... (Score:2, Funny)
Well yeah, 50% of all statisticians finished in the bottom half of their class.
Re:Ya' know... (Score:2)
eg. if there 100 statisticians, the mean score is 37 and 10 statisticians scored that, only 45% of statisticians are techincally in the bottom half (and 45% in the top half). 10% are exactly in the middle.
You could say that the 10% are in both the bottom and top half... in which case 55% are in the bottom half and 55% are in the top half!!
Re:Ya' know... (Score:3, Funny)
Re:Ya' know... (Score:2)
I'm not sure now if a comment like that puts you in the top half, or bottom half.
Re:Ya' know... (Score:1)
I hope YOU know that ... (Score:1)
Sounds useful. (Score:2, Funny)
Re:Sounds useful. (Score:1)
It wouldn't, that's how.
You ever thought [command]-[F] while... (Score:1)
I have, have actually had my arm and fingers twitching for the keyboard...
I think I need a major vacation soon, somewhere with no IT-devices whatsoever.
a.c.
Re:Sounds useful. (Score:1)
The Real Challenge is Further Off (Score:4, Funny)
"But their method could also be applied to a broad range of applications, like discovering a new galaxy, monitoring transactions for fraud or identifying the carrier of a virulent disease among millions of people."
When asked about more advanced applications for the technology, researchers replied it will probably be "quite a while" before the technology could be used for extremely high noise environments. Said one, "I mean, it's going to be a long time before we're up to finding finding useful comments on Slashdot or something."
Numb3rs (Score:2, Funny)
Re:Numb3rs (Score:5, Funny)
Nice quote (Score:2)
It's a great quote, I'd love to be able to use it and attribute it properly.
Jim
Re:Nice quote (Score:2)
Now that's a change... (Score:2, Funny)
So, basically its the one search engine that can only find the words "horny teen nekkid" if it is NOT on a pr0n-page. I can see uses for that
Re:Now that's a change... (Score:1)
jack thompson.
see, didn't need a search engine. and why pray tell would he want to find the one page that said "nekkid teen sexy" and wasn't a pr0n page? Oh to press charges of course. Pages like that corrupt our young! Never mind the real pr0n, thats so.. out there. It's the one page that mentions a single naughty word thats in for the trouble!
PDF Warning? (Score:1)
Re:PDF Warning? (Score:2)
Even on my faster PCs, reading a large PDF feels slower than it shoul
Re:PDF Warning? (Score:2)
Re:PDF Warning? (Score:2)
They most certainly are. Of course, I'm not using IE or FF. In any case, that doesn't really justify the warning.
9...9...9...9... (Score:1)
Re:9...9...9...9... (Score:2)
-Charles
Re:9...9...9...9... (Score:3, Insightful)
The more you constrain your allegedly random process, such as by insisting that it produce output without "patterns" -- whatever those are -- the less random it actually is.
To put it in more concrete terms, which is more random -- a coin which flips 50-50 heads/tails with no other constraints whatsoever, or a coin which flips 50-50 but will never, say, flip 100 heads in a row, and will never exactly alternate, and will never produce the bit sequence corresponding to the ASCII encoding of the text
Re:9...9...9...9... (Score:2)
Re:9...9...9...9... (Score:2)
Yes, but only if you look at smaller segments, which changes your dataset. For example, if you spot the first 30 digits of Pi in an infinitely random set, the question becomes is your random set Pi? If not, the pattern only applies to those 30 digits and thus your set changes and is no longer the infinite set of random data.
And they aren't dealing with an "infinite" set, but
Re:9...9...9...9... (Score:2)
If you have an infinite amount of hey, and throw in an infinite amount of needles, you'll still be spending a lot of time finding the needles.
Re:9...9...9...9... (Score:5, Insightful)
Case Western Reserve University (Score:4, Interesting)
Why? I have no idea. Some "university branding" thing that some people thought was important to the growth of the campus or something. Apparently it ticked a bunch of alumni (from the original Western Reserve University) too.
Knowing is half the battle.
Re:Case Western Reserve University (Score:2)
Despite the fact that its OK to officially call it 'Case' now (it wasnt OK to do so in '97), CWRU is still a valid abbreviation. Plus I paid so much money to that place that I'll call it whatever I damn well please.
- '02
Re:Case Western Reserve University (Score:2, Funny)
Re:Case Western Reserve University (Score:2)
Re:Case Western Reserve University (Score:2)
The true offense in the OP was calling it "Case Western". It's not a "reserve university", whatever that means.
I've always just called it "Case" since I started there as an undergraduate in 1994, while my e-mail address still contains cwru.edu. Both of those are used now - "Case" just validates the fact that most people really get tired of saying the whole name over and over again.
Speaking of needle in a haystack ... (Score:5, Funny)
1) INDUSTRIAL MAGNENT
2) BLIND LUCK
3) BURN THE HAY, PICK UP THE NEEDLE
4) STATISTICAL ANALYSIS (SINCE NEEDLES IN HAYSTACKS ARE NOT PLACED AT RANDOM, THEY ARE SUBJECT TO REGRESSION ANALYSIS)
5) OFFSHORE TO CHINA WHERE LABOR IS CHEAPER, SEARCH THE HAY WITH 10000 OF WORKERS.
6) WAIT YEARS UNTIL THE HAY DECAYS, PICK UP THE NEEDLE
7) SPREADOUT THE HAY, HIRE BAREFOOT HAY WALKERS
8) TAKE ALL THE HAY, PUT IN A POOL OF WATER - HAY WILL FLOAT, AND NEEDLE WILL SINK
9) LET COWS EAT THE HAY, X-RAY ALL THE COWS!
10) TRIAL AND ERROR - ONE PERSON
Mythbusters (Score:3, Informative)
-everphilski-
Re:Mythbusters (Score:2)
I'd like to see a way of finding a needle in a haystack that left you with a (largely) intact haystack afterwards, not a pile of ash or a wet sludge.
Huge inductive coils would be a good start... probably wouldn't find the bone one though - maybe some kind of MRI?
Re:Mythbusters (Score:2)
Was that the one where Kari(sp?) mugged and did things for the camera in the cutesy, girly way? Stupid me, that pretty much describes every episode she's been in.
Scotty was a no-bullshit welder (and very attractive, to boot). Bring her back, she's a *real* babe.
Re:1) INDUSTRIAL MAGNENT (Score:2, Funny)
DBAs everwhere are cringing and covering their data.
Re:Speaking of needle in a haystack ... (Score:2)
11) LET COWS EAT THE HAY, DISECT DEAD COW
lameness filter blah
Re:Speaking of needle in a haystack ... (Score:2)
Category: Cattle
Entry: 1097
Ranchers will commonly intentionally force feed a smooth magnet [magnetsource.com] to calves. Because of it's weight, it will remain in the rumen or reticulum (the 1st and 2nd stomach compartments, respectively) for the life of the cow. Fields often have stray bits of metal small enough to be accidentally ingested while grazing, such as barb wire bits, fence staples, screws, etc. When stuck on the magnet, the pieces are eff
Re:Speaking of needle in a haystack ... (Score:2)
the problem is you have to churn the hay so the needle won't get stuck to the floating hay.
To find a signal in a sea of noise... (Score:1)
Maybe Slashdot can use it to find dupes (Score:1)
Of
Message
Maybe Slashdot can use it to find dupes (Score:1)
Re:Maybe Slashdot can use it to find dupes (Score:1)
Got your ratio reversed (Score:2)
In Slashdot, the dupe to original article ratio is so high, its the original articles that need finding, not the dupes. Funny, though, from what I've seen, it seems like this particular algorithm would be quite efficient in doing that (e.g. it specializes in finding the data that is different, versus categorizing existing data).
SETI? (Score:2)
Mythbusters did this... (Score:2)
Oh, wait. Their talking about data. Never mind.
We are at the horizon of a cultural singularity... (Score:1)
Throughout history, we championed the content creator. Only a tiny fraction of the population could write or understood math or science. Only a tiny fraction could dedicate themselves to the arts.
Most individuals' time was consumed by being agrarian generalists: they owned a farm, and they were constantly occupied by all the repairs and maintenance of their property. It wasn't a job, it was a way of life. But now, more and more, our economy makes us all incredible specialists. We're conf
Re:We are at the horizon of a cultural singularity (Score:2)
Significant % of patterns in randomness (Score:3, Informative)
Its better to either have a a priori hypothesis to look for one specific, pre-defined pattern in a mound data than to see if any pattern is in the data. Or, if one insists on looking for many patterns, then the standards for statistical significance must be correspondingly higher.
Re:Significant % of patterns in randomness (Score:5, Informative)
If you're not correcting for multiple hypothesis testing, you are correct. If you do have 100% random data that holds to perfect randomness at all scales (which I'm not sure is even possible) and correct for multiple hypothesis testing, then you'll find exactly what you "should" find: no significant pattern.
You mention "Cancer clusters" as an example of attribution of significance to insignificant findings. However, these clusters are often found (at least in the genetics research realm) by hierarchical clustering, which is self-correcting for multiple hypothesis testing. If you're speaking of demographic surveys which find that (e.g.) "black females in Tahiti who were exposed to
Re:Significant % of patterns in randomness (Score:2)
No, God put the figure of Jesus in the sky, but made it not look too much like Jesus just to test the difference between the believers and non-believers. Trust me, it was not easy to do all that with nobody looking.
SETI? (Score:2)
Regarding fraudulent transactions... (Score:2, Interesting)
1. A knowledge of your purchasing pattern as a consumer. To wit, having a statistically significant sample of what are valid transactions as well as knowing your credit score and income.
Do you shop at high-end stores? Do you use your card for primarily travel and entertainment? Do you use your card for everyday purchases? How much of your line-of-credit do you tend to use?
2. A comparison of recent
Hey, wait a minute! (Score:4, Funny)
WTF? Roland? You feeling OK?
Re:Hey, wait a minute! (Score:2)
Relationship to other information theory concepts? (Score:2)
As a particle physicist (Score:5, Interesting)
I also know that these sorts of algorithms are created all of the time. In fact, someone in my lab got his Ph.D. for applying a neural network to this problem. Furthermore, these algorithms are not "plug-n-play". They must be manually adjusted, by a team with a deep in-depth knowledge of the system in order to be useful.
So trust me when I say that Roland has blown this out of proportion. Congratulations to the CWRU team for getting the PRL paper published, but this is hardly the kind of ground-breaking news that deserves to be on Slashdot.
Re:As a nervous system. (Score:3, Funny)
Yarbles (Score:2)
Develop, not Discover (Score:2)
What I perceive as a bigger problem... (Score:2)
I don't want to rain on the parade (Score:4, Interesting)
If you download the linked paper, on the second page they talk about the Breit-Wigner (Cauchy) density psi, and later they claim that their score process has zero expectation. Now, everyone knows that the Breit-Wigner does not *have* an expectation, and it's often used as an example where the asymptotic normal (Gaussian) distribution approximation doesn't hold. But still, they derive all sorts of distribution formulas involving a chi squared and a Gaussian process, as if there was no problem at all with the Breit-Wigner tails.
I think their derivation is quite possibly wrong.
Re:I don't want to rain on the parade (Score:2)
You keep using that word. I do not thing it means what you think it means.
Re:I don't want to rain on the parade (Score:2, Insightful)
Re:I don't want to rain on the parade (Score:3, Insightful)
Funnily enough, the density f they use in the monte carlo simulation appears to be truncated to be in the interval [0,2] (otherwise it wouldn't be integrable). That suggests that in pr
Needle/Haystack (Score:2)
public static Object find(Object needle, Object[] haystack) {
for (int i = 0; i < haystack.length; i++)
if (haystack[i].equals(needle))
return needle;
return null;
}
Able Danger (Score:2)
This raises the issue of false al
Obligatory (Score:2)
Re:Obligatory (Score:1)
Re:Obligatory (Score:2)
Re:Obligatory (Score:1)
Re:Obligatory (Score:2)
ITagging (Score:1, Funny)
Are you telling us there's such a thing as Intelligent Tagging?
Re:Why is technology starting to... (Score:1)
Re:But will it help me... (Score:2)
Re:Roland Alert (Score:1, Funny)
Don't you know the editors are in cahoots with the the Beatles Beatles guy now?
Please, try to keep up with the conspiracy theories, mkay? Jeez!
Re:Monte Carlo experimental results? (Score:2)
You know, you GO into computational physics thinking it's all casinos and drinking, and this is what you find...