Text Mining the Multiverse 137
The NYT has a decent piece about text-mining, skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
Only great masters of style can succeed in being obtuse. -- Oscar Wilde Most UNIX programmers are great masters of style. -- The Unnamed Usenetter
Quick... (Score:2)
Then again, we could just skip the patent and let WWdN die too. Seems like the internet community would break even.
Re:Quick... (Score:1)
Re:Quick... (Score:1)
Prior Art (Score:1)
No need to register! Here's the Text! (Score:1, Informative)
"There is just too much literature to be able to go through it all," said Dr. Liebman, the
Copyright (Score:1)
Re:Copyright (Score:1)
I didn't read the article (Score:2, Insightful)
(Yes I am a lazy /. reader)
Re:I didn't read the article (Score:2)
I guess they figured out why so many readers are 90 year old CEOs of religeous organizations in beverly hills.
Re:I didn't read the article (Score:2)
Heywood Jablowme@whitehouse.gov
Dick Hertz@yahoo.com
HarryPNisss@microsoft.com
SudoN
ImaNassHole@whitehouse.gov
Hom
and others..Doesn't work?
Re:I didn't read the article (Score:3, Funny)
Re:I didn't read the article (Score:1)
Re:I didn't read the article (Score:2)
Reminds me of one of the companies I worked at, long ago. SMTP addresses went first initial, last name. However they made an exception for a Samuel Hitt.
Re:I didn't read the article (Score:2)
What's his nospam address? noshitt@aol.com? LOL
Re:I didn't read the article (Score:1)
Re:I didn't read the article (Score:2)
perhaps more importantly (Score:1)
The fact is that the New Yor
Re:bullshit (Score:1)
uh, no (Score:1)
Perhaps Slashdot should get in touch with the NYT and see if they can get a partnership set up, but stealing someone else's wouldn't be such a hot idea.
Re:Alright Slack... (Score:1)
Thanks!
Re:Alright Slack... (Score:2)
I'd like that.
Re:Alright Slack... (Score:1)
Oh you mean (Score:1)
Bringing Star Trek-like Computing one step closer! (Score:1)
create large volumes of junk to feed this.. (Score:2)
text mining is fun until someone creates something to generate a bunch of junk to feed to the text miners..
take a look at my
Re:create large volumes of junk to feed this.. (Score:2)
Otherwise, totally not an issue.
Re:create large volumes of junk to feed this.. (Score:1)
Too late. It's called Slashcode.
Ba-dum bum. =P
RTFA (Score:2, Funny)
Like those ppl who actually RTFA and try to get "FORST PIST!!!"?
Ouch! (Score:2)
That sounds painful!
why the "Multiverse" buzzword ? (Score:2)
Did the poster even know what it means ?
Re:why the "Multiverse" buzzword ? (Score:1)
Re:why the "Multiverse" buzzword ? (Score:1)
Re:why the "Multiverse" buzzword ? (Score:2)
One of the groups that I work with [gatech.edu] does some data analysis stuff with how data changes over space (location based) and time (your beliefs yesterday vs. your beliefs today) and the ilke -- so this could be something along those lines.
Or like you said, it could just be a buzzword!
Re:why the "Multiverse" buzzword ? (Score:1)
-Seriv
obviously its (Score:1)
8-PP
Support non-whoring reg-free linkage! (Score:5, Informative)
The same article is also posted at CNET, which doesn't require registration. They also have it in a nice single-page [com.com] format for those that don't like to keep hitting "next".
Oh shit (Score:1)
Brute forcing the problem (Score:3, Interesting)
They make it sound like Semantic and Contextual modeling is done on the fly -- the way I see this system, it does this based on a preset lexicon or database.
Thats again brute forcing the problem -- a lot of researchers in the field feel that real solution does not lie that way. We need to analyze this from ground up, to gather meaning from data.
The above method fails the moment you have spatial and temporal data -- my lexicon may evolve over a period of time.
You're looking at all the information and then deciding whats for you -- a better way is to develop an "instinct" for the right kind of information and refine it.
If you really want to know where data mining is going to, look at KDD or SIGMOD -- thats where all the real action is.
shameluss plug of parent (Score:2)
MOD PARENT UP (and me down i don't care)
finally a
Re:Brute forcing the problem (Score:2)
Is text mi
Re:Brute forcing the problem (Score:2)
Sure there are those who short change some approaches because they have temporal limitations. New data comes in and you need to categorize that too and determine it's context or supremacy to data y
Re:Brute forcing the problem (Score:2)
However, one thing that I have learnt (the hard way) over a period of time is that Ontology (Specification of data conceptualization) is infinitely more important than Epistemology (Knowledge of the data).
There is nothing wrong with a system which has tags, the trouble is when you classify it eitherway -- the references of the tags are once again more important than how they are acquired. You could perhaps have a purely automated system, maybe
Fun with numbers (Score:3, Interesting)
The scary part (that took mathmeticians a long time to accept and longer to figure out) is that the distribution is the same for any sufficiently representitive sample of text....
Re:Fun with numbers (Score:2)
Benford's law is the name of this phenomenon. Its even more interesting because it is independant of base!
There are many ways that this is used, including detecting human tamp
Re:Fun with numbers (Score:1)
Re:Fun with numbers (Score:1)
P.S. The filters suck. *sigh*
Re:Fun with numbers (Score:1)
Make a list of the areas of all the lakes in your state. Doesn't matter what the units are. The distribution will be so the highest count will be zeros, and the lowest count will be the nines.
Same for a list of all the house numbers in a city. Same for a list of just about anything you can think of, in whatever units you want.
This can be used to detect fraud. For example, if you look at the finaci
Re:Fun with numbers (Score:1)
Run the same thing on, say, Microsoft's home page and you get:
Re:Fun with numbers (Score:2)
I find that a large pool of USENET posts works best.
How Long.. (Score:1)
Just head over to tellmewhatthisthingyisabout.com > Print
Speed reading (Score:1)
Re:Speed reading (Score:1)
Hmmm, isn't there a prerequsite??? (Score:2)
Maybe this is just an attempt at getting a machine to generate core knowledge but then haven't they been working on common sense, which is sorta needed first?
Fark Registration. Get in without stupid reg. (Score:1, Informative)
Red Necks (Score:2, Funny)
If you have an infinite number of red necks
You mean... (Score:1)
like grep?
I'm sorry, reading this text requires meta-technology.
Could do us a big favor (Score:3, Funny)
Dear Text Miners,
Please start here: http://slashdot.org [slashdot.org]
Thanks so much.
Re:Could do us a big favor (Score:2)
sheesh. Some people!
Re:Could do us a big favor (Score:1)
Should it filter Funny?
sorry, still sounds a lot like text searching (Score:1)
For any google results
"Category" is shown right on top of the results.
"Links" - try link:slashdot.org & related:slashdot.org as google queries.
If someone is doing research on computer modeling, for example, it not only knows to discard documents about fashion models but can also extract important phrases, terms, names and locations
Try the google advanced search you can search with "all of
Cool Picture (Score:1)
You mean you don't already have one? (Score:1, Offtopic)
Well, DUH! (Score:3, Insightful)
May I offer that computers make no sense of what they are reading & that "smart people with knowledge of the particular subject" aren't optional if the results of text-mining are to be of any usefulness whatsoever, at least in any kind of reasonable time frame.
Otherwise, the text-mining computer is playing the old "99 monkeys with typewriters" game...
Re:Well, DUH! (Score:2)
Re:Well, DUH! (Score:1)
Hence the CYC project (Score:1)
The compelling dream is that you laboriously load up a computer with enough facts so that it can glean understanding of what it's reading, and one glorious day the computer has enough smarts to make sense of things on its own, and two weeks after crawling the entire Internet, it knows everything.
Hence Doug Lenat's Cyc [cyc.com], now partly open source [opencyc.org]. Unfortunately that glorious day has been "a few years away" for over 13 years.
Re:Hence the CYC project (Score:1)
How do you know? You cannot verify human authorship just by looking at the text. Perhaps the Goatse troll is really an AI bot.
while text minning... (Score:1)
-Seriv
Re:while text minning... (Score:1)
-Seriv
Text Mining for Corellation (Score:1)
but what about the data itself? (Score:3, Insightful)
allright, you can take huge amounts of text and apply some smart tricks to extract patterns from it.
but how can you determine whether the original data was trustworthy?
take the example of genome annotation (description of gene function), which would be helped greatly by including more functional descriptions from scientific literature. how do you determine whether the original publication was backed by solid experimental research?
by the reviewers of the articles? i don't think so, peer review is a snakepit filled with politics. by the amount of people who cited it? hmmmm... so hip subjects are more true?
me personally, because i'm experienced, can recognise bullshit articles when i see them. but how to translate this into an algorithm... anyone any ideas about this? or even working solutions?
(of course this is an example from my field of expertise - biology, but it applies to any set of text data/articles IMO)
Re:but what about the data itself? (Score:2)
For example, an ACM/IEEE source would have a much higher trust metric than say, from some local conference in Egypt (no offence to any local conferences in Egypt, but you get the wind
Re:but what about the data itself? (Score:2)
you rely on peer review, on citation indices, so mostly IM-not-so-HO on matters of politics.
when you scan abstracts yourself, you can dig into the detail when something looks interesting enough, but the decision making process that drives me while scanning abstracts is not much influenced by the fact whether it is in a high impact journal (or any other high impact publishing body) or in something mostly not noteworthy.
to put it in another
Re:but what about the data itself? (Score:2)
I would not apply a trust metric to an article based on the journal alone...
Some notes... (Score:2, Interesting)
Dijkstra once said "The question of whether computers can think is like the question of whether submarines can swim."
(2)As noted in the article sarcasm is very hard to detect. If you think about it even many people have a hard time recognizing it. How are we supposed to develop an intelligent system when we "intelligent" humans don'
Cognitive Science? (Score:1)
Warballs (Score:2)
They just had to get it in somehow
like the 858-page report on the congressional inquiry into intelligence failures regarding the Sept. 11, 2001, terrorist attacks.
statistical nlp (Score:1)
I gave up a few years ago, now I mostly use statiscal approaches (markov processes, word counts, huge databases of proper names, etc.)
-Mark
Re:statistical nlp (Score:2)
(I was trying to type while holding my wife's baby parrot, and he sometimes goes nuclear if you don't pay enough attention to him :-)
BTW, pardon the shameless plug, but I added a short chapter on statistical nlp (simple enough example program to understand easily) to my free Java/AI web book.
-Mark
unfortunately (Score:2)
Skimming random information? (Score:1)
I like to call it High School.
That's what studentf are for. (Score:1)
just curious (Score:1)
Text-miner in MS-Word (Score:1)
KDD Cup (Score:4, Informative)
FBI agents? (Score:1)
Apart from suggesting the jibe that, of course, only an ex-fbi dick could think that anyone would want to model his/her behaviour, this misses the point that text-mining is intended to find precisely thos
Subrogation - Firemen's Fund would do well to (Score:1)
Re:What's up with Slashdot? (Score:1, Offtopic)
Re:What's up with Slashdot? (Score:1)
That explains some of the problems, but not everything. For instance, why haven't I had mod points in nearly two years, despite having good karma and contributing to conversations (rather than trolling)? Yes, I know the rules about getting moderation points, but even with those I'd expect to get points at least two or three times a year, not every two years. As well, I recently started noticing that Slashdot has popup ads now (I saw one in the last day, and then added slashdot into my popup blocker's bla
Re:What's up with Slashdot? (Score:1, Offtopic)
Re:What's up with Slashdot? (Score:1, Offtopic)
in addition there is also this tasteless group of guys who keep making posts about greased-up yoda dolls which has also forced me to start browsing at +2..
seems that mod points are being handed out with less frequency than they were before.
I think they should start handing them out for people with "excellent karma" and then track if the metamods agree with the point distribution..
that is just me..
Re:What's up with Slashdot? (Score:1, Offtopic)
Re:What's up with Slashdot? (Score:1, Offtopic)
Now, as far as the reason for fewer posts, I know that the editors have said that late summer-fall tends to be slower for news, but I also think they've been putting up some boring articles latel
Re:What's up with Slashdot? (Score:2)
Now I refresh and see a review of a pirate book with ~70 "+2" comments and "Third Anniversary of Bezos-Backed Patent Reform," which went completely ignored. Meh.
Of course, I'm not helping by posting near-useless comments like this...
Re:What's up with Slashdot? (Score:2)
Re:What's up with Slashdot? (Score:1)