Slashdot Log In
Text Mining the Multiverse
Posted by
michael
on Fri Oct 17, 2003 03:41 PM
from the mother-lode dept.
from the mother-lode dept.
The NYT has a decent piece about text-mining, skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Quick... (Score:2)
Then again, we could just skip the patent and let WWdN die too. Seems like the internet community would break even.
No need to register! Here's the Text! (Score:1, Informative)
"There is just too much literature to be able to go through it all," said Dr. Liebman, the
I didn't read the article (Score:2, Insightful)
(Yes I am a lazy /. reader)
Re:I didn't read the article (Score:2)
I guess they figured out why so many readers are 90 year old CEOs of religeous organizations in beverly hills.
Re:I didn't read the article (Score:2)
Heywood Jablowme@whitehouse.gov
Dick Hertz@yahoo.com
HarryPNisss@microsoft.com
SudoN
ImaNassHole@whitehouse.gov
Hom
and others..Doesn't work?
Re:I didn't read the article (Score:3, Funny)
Re:I didn't read the article (Score:2)
Reminds me of one of the companies I worked at, long ago. SMTP addresses went first initial, last name. However they made an exception for a Samuel Hitt.
Re:I didn't read the article (Score:2)
What's his nospam address? noshitt@aol.com? LOL
Re:I didn't read the article (Score:2)
Re:Alright Slack... (Score:1)
Thanks!
Re:Alright Slack... (Score:2)
I'd like that.
Bringing Star Trek-like Computing one step closer! (Score:1)
create large volumes of junk to feed this.. (Score:2)
text mining is fun until someone creates something to generate a bunch of junk to feed to the text miners..
take a look at my
Re:create large volumes of junk to feed this.. (Score:2)
Otherwise, totally not an issue.
RTFA (Score:2, Funny)
Like those ppl who actually RTFA and try to get "FORST PIST!!!"?
Ouch! (Score:2)
That sounds painful!
why the "Multiverse" buzzword ? (Score:2)
Did the poster even know what it means ?
Re:why the "Multiverse" buzzword ? (Score:2)
One of the groups that I work with [gatech.edu] does some data analysis stuff with how data changes over space (location based) and time (your beliefs yesterday vs. your beliefs today) and the ilke -- so this could be something along those lines.
Or like you said, it could just be a buzzword!
Support non-whoring reg-free linkage! (Score:5, Informative)
The same article is also posted at CNET, which doesn't require registration. They also have it in a nice single-page [com.com] format for those that don't like to keep hitting "next".
Oh shit (Score:1)
Brute forcing the problem (Score:3, Interesting)
They make it sound like Semantic and Contextual modeling is done on the fly -- the way I see this system, it does this based on a preset lexicon or database.
Thats again brute forcing the problem -- a lot of researchers in the field feel that real solution does not lie that way. We need to analyze this from ground up, to gather meaning from data.
The above method fails the moment you have spatial and temporal data -- my lexicon may evolve over a period of time.
You're looking at all the information and then deciding whats for you -- a better way is to develop an "instinct" for the right kind of information and refine it.
If you really want to know where data mining is going to, look at KDD or SIGMOD -- thats where all the real action is.
shameluss plug of parent (Score:2)
MOD PARENT UP (and me down i don't care)
finally a
Re:Brute forcing the problem (Score:2)
Is text mi
Re:Brute forcing the problem (Score:2)
Sure there are those who short change some approaches because they have temporal limitations. New data comes in and you need to categorize that too and determine it's context or supremacy to data y
Re:Brute forcing the problem (Score:2)
However, one thing that I have learnt (the hard way) over a period of time is that Ontology (Specification of data conceptualization) is infinitely more important than Epistemology (Knowledge of the data).
There is nothing wrong with a system which has tags, the trouble is when you classify it eitherway -- the references of the tags are once again more important than how they are acquired. You could perhaps have a purely automated system, maybe
Fun with numbers (Score:3, Interesting)
The scary part (that took mathmeticians a long time to accept and longer to figure out) is that the distribution is the same for any sufficiently representitive sample of text....
Re:Fun with numbers (Score:2)
Benford's law is the name of this phenomenon. Its even more interesting because it is independant of base!
There are many ways that this is used, including detecting human tamp
Re:Fun with numbers (Score:2)
I find that a large pool of USENET posts works best.
How Long.. (Score:1)
Just head over to tellmewhatthisthingyisabout.com > Print
Hmmm, isn't there a prerequsite??? (Score:2)
Maybe this is just an attempt at getting a machine to generate core knowledge but then haven't they been working on common sense, which is sorta needed first?
Red Necks (Score:2, Funny)
If you have an infinite number of red necks
Could do us a big favor (Score:3, Funny)
Dear Text Miners,
Please start here: http://slashdot.org [slashdot.org]
Thanks so much.
Re:Could do us a big favor (Score:2)
sheesh. Some people!
Well, DUH! (Score:3, Insightful)
May I offer that computers make no sense of what they are reading & that "smart people with knowledge of the particular subject" aren't optional if the results of text-mining are to be of any usefulness whatsoever, at least in any kind of reasonable time frame.
Otherwise, the text-mining computer is playing the old "99 monkeys with typewriters" game...
Re:Well, DUH! (Score:2)
but what about the data itself? (Score:3, Insightful)
allright, you can take huge amounts of text and apply some smart tricks to extract patterns from it.
but how can you determine whether the original data was trustworthy?
take the example of genome annotation (description of gene function), which would be helped greatly by including more functional descriptions from scientific literature. how do you determine whether the original publication was backed by solid experimental research?
by the reviewers of the articles? i don't think so, peer review is a snakepit filled with politics. by the amount of people who cited it? hmmmm... so hip subjects are more true?
me personally, because i'm experienced, can recognise bullshit articles when i see them. but how to translate this into an algorithm... anyone any ideas about this? or even working solutions?
(of course this is an example from my field of expertise - biology, but it applies to any set of text data/articles IMO)
Re:but what about the data itself? (Score:2)
For example, an ACM/IEEE source would have a much higher trust metric than say, from some local conference in Egypt (no offence to any local conferences in Egypt, but you get the wind
Re:but what about the data itself? (Score:2)
you rely on peer review, on citation indices, so mostly IM-not-so-HO on matters of politics.
when you scan abstracts yourself, you can dig into the detail when something looks interesting enough, but the decision making process that drives me while scanning abstracts is not much influenced by the fact whether it is in a high impact journal (or any other high impact publishing body) or in something mostly not noteworthy.
to put it in another
Re:but what about the data itself? (Score:2)
I would not apply a trust metric to an article based on the journal alone...
Some notes... (Score:2, Interesting)
Dijkstra once said "The question of whether computers can think is like the question of whether submarines can swim."
(2)As noted in the article sarcasm is very hard to detect. If you think about it even many people have a hard time recognizing it. How are we supposed to develop an intelligent system when we "intelligent" humans don'
Warballs (Score:2)
They just had to get it in somehow
like the 858-page report on the congressional inquiry into intelligence failures regarding the Sept. 11, 2001, terrorist attacks.
unfortunately (Score:2)
KDD Cup (Score:4, Informative)
Re:What's up with Slashdot? (Score:1, Offtopic)
Re:What's up with Slashdot? (Score:1, Offtopic)
Re:What's up with Slashdot? (Score:1, Offtopic)
in addition there is also this tasteless group of guys who keep making posts about greased-up yoda dolls which has also forced me to start browsing at +2..
seems that mod points are being handed out with less frequency than they were before.
I think they should start handing them out for people with "excellent karma" and then track if the metamods agree with the point distribution..
that is just me..
Re:What's up with Slashdot? (Score:1, Offtopic)
Re:What's up with Slashdot? (Score:2)
Now I refresh and see a review of a pirate book with ~70 "+2" comments and "Third Anniversary of Bezos-Backed Patent Reform," which went completely ignored. Meh.
Of course, I'm not helping by posting near-useless comments like this...
Re:What's up with Slashdot? (Score:2)
Re:statistical nlp (Score:2)
(I was trying to type while holding my wife's baby parrot, and he sometimes goes nuclear if you don't pay enough attention to him :-)
BTW, pardon the shameless plug, but I added a short chapter on statistical nlp (simple enough example program to understand easily) to my free Java/AI web book.
-Mark