Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Science Technology

Text Mining the Multiverse 137

The NYT has a decent piece about text-mining, skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
This discussion has been archived. No new comments can be posted.

Text Mining the Multiverse

Comments Filter:
  • Quick, someone patent it before Microsoft does or else Slashdot is going to be the next casualty.

    Then again, we could just skip the patent and let WWdN die too. Seems like the internet community would break even.
  • MICHAEL N. LIEBMAN knows his limitations. Even with a Ph.D. and a long career in medical research, he cannot keep up with all the developments in his area of interest, breast cancer. Medline, the database that already houses more than 10 million abstracts for journal articles, is adding 7,000 to 8,000 abstracts per week. Only a fraction of these are about cancer, but the volume of information is daunting nonetheless.

    "There is just too much literature to be able to go through it all," said Dr. Liebman, the
  • Why does slashdot keep linking to articles that require NYT registration? Isn't there some sort of Google news out there?

    (Yes I am a lazy /. reader)

  • I've always wanted to ask the computer to find all references to some complex interplay of topics at hand the way those Star Fleet engineers were always able to in TNG...
  • yea...

    text mining is fun until someone creates something to generate a bunch of junk to feed to the text miners..

    take a look at my .sig

  • RTFA (Score:2, Funny)

    by devphaeton ( 695736 )
    skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.

    Like those ppl who actually RTFA and try to get "FORST PIST!!!"?
  • Multiverse doesn't appear anywhere in the article. Multiverse is a technical word, for interpreting Quantum Physics. It is totally misplaced in this news submission.
    Did the poster even know what it means ?
  • by Anonymous Coward on Friday October 17, 2003 @03:46PM (#7243816)
    Brought to you by your favorite anonymous non-whoring poster: the Google link [nytimes.com].

    The same article is also posted at CNET, which doesn't require registration. They also have it in a nice single-page [com.com] format for those that don't like to keep hitting "next".

  • If I apply this to slashdot, I'll have only 3-4 posts to read everyday... What will I do all day at work?
  • by metlin ( 258108 ) on Friday October 17, 2003 @03:49PM (#7243859) Journal
    To make sense of what it is reading, the software uses algorithms to examine the context behind words.

    They make it sound like Semantic and Contextual modeling is done on the fly -- the way I see this system, it does this based on a preset lexicon or database.

    Thats again brute forcing the problem -- a lot of researchers in the field feel that real solution does not lie that way. We need to analyze this from ground up, to gather meaning from data.

    The above method fails the moment you have spatial and temporal data -- my lexicon may evolve over a period of time.

    You're looking at all the information and then deciding whats for you -- a better way is to develop an "instinct" for the right kind of information and refine it.

    If you really want to know where data mining is going to, look at KDD or SIGMOD -- thats where all the real action is.
    • never did this before

      MOD PARENT UP (and me down i don't care)

      finally a /. comment that makes sense
    • I don't believe that medical terminology & medical journals contain a lot of terminology that change over time. Most medical words are latin-based and fairly rigid in their usage. A tibia is a tibia is a tibia. What the doctor has created is very specific application to skim a very large (volume) of information and report back things that might be of interest. You are correct that an "instinct" would be much more useful. This does not however make the doctor's accomplishment any less viable.

      Is text mi
    • There are more than enough opinions about "the right way" to model data from a semantic or centextual standpoint. Like most things there's the academic approach and there's one that a company can afford. Whether or not either is appropriate depends on your needs, point of view and the size of the coporate wallet.

      Sure there are those who short change some approaches because they have temporal limitations. New data comes in and you need to categorize that too and determine it's context or supremacy to data y
      • That is true, that the answer is not straight nor is it simple.

        However, one thing that I have learnt (the hard way) over a period of time is that Ontology (Specification of data conceptualization) is infinitely more important than Epistemology (Knowledge of the data).

        There is nothing wrong with a system which has tags, the trouble is when you classify it eitherway -- the references of the tags are once again more important than how they are acquired. You could perhaps have a purely automated system, maybe
  • Fun with numbers (Score:3, Interesting)

    by ajs ( 35943 ) <ajs@ajs . c om> on Friday October 17, 2003 @03:50PM (#7243869) Homepage Journal
    Here's some fun you can have with numbers. Take this Perl one-liner:
    perl -ne '$x{$1}++ while /(\d)/g;END{print map {"$_ occured $x{$_} times\n"} sort {$a<=>$b} keys %x}' xxxx
    and run it with "xxxx" replaced by the name of some large text file that you create by saving email messages, web pages, log files, what have you.

    The scary part (that took mathmeticians a long time to accept and longer to figure out) is that the distribution is the same for any sufficiently representitive sample of text....
    • Ooops. While you see a distribution there, that's not what I was trying to point out. The correct one-liner would be

      perl -ne '$x{$1}++ while /\b([1-9])\d*\b/g;END{$all+=$_ foreach map {$x{$_}} keys %x;print map {sprintf "%d occured %.1f%% (%d times)\n",$_,$x{$_}/$all*100,$x{$_}} sort {$a<=>$b} keys %x;print "$. lines read\n"}'

      Benford's law is the name of this phenomenon. Its even more interesting because it is independant of base!

      There are many ways that this is used, including detecting human tamp

    • The key is sufficiently representitive. And I'm not quite clear on what that means, but I know some examples:

      Make a list of the areas of all the lakes in your state. Doesn't matter what the units are. The distribution will be so the highest count will be zeros, and the lowest count will be the nines.

      Same for a list of all the house numbers in a city. Same for a list of just about anything you can think of, in whatever units you want.

      This can be used to detect fraud. For example, if you look at the finaci
    • For those curious of what the output gives, here's the output as run on the Slashdot homepage stripped of HTML etc:

      1 occured 44.9% (57 times)
      2 occured 17.3% (22 times)
      3 occured 9.4% (12 times)
      4 occured 7.1% (9 times)
      5 occured 11.0% (14 times)
      6 occured 1.6% (2 times)
      7 occured 3.9% (5 times)
      8 occured 2.4% (3 times)
      9 occured 2.4% (3 times)
      315 lines read

      Run the same thing on, say, Microsoft's home page and you get:

      1 occured 13.6% (3 times)
      2 occured 27.3% (6 times)
      3 occured 4.5% (1 times)
      4 occured 18.2% (

  • ..until no student ever has to research any topic again?

    Just head over to tellmewhatthisthingyisabout.com > Print
  • I took a speed-reading course and read War and Peace in twenty minutes. It involves Russia. -- Woody Allen
  • That the text has to first contain some knowledge in it to begin with?

    Maybe this is just an attempt at getting a machine to generate core knowledge but then haven't they been working on common sense, which is sorta needed first?
  • by Anonymous Coward
    Tired of going through their stupid registration? CLICK HERE [nytimes.com]
  • Red Necks (Score:2, Funny)

    by k_stamour ( 544142 )
    "to extract some sort of refined knowledge from it." hum....
    If you have an infinite number of red necks ....Infinite number of shot guns & shotgun shells.... And an infinite number of stop signs, you will eventually get Shakespeare in brail.....
  • kimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.

    like grep?

    I'm sorry, reading this text requires meta-technology.

  • by Strange Ranger ( 454494 ) on Friday October 17, 2003 @03:57PM (#7243964)
    ...skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.

    Dear Text Miners,

    Please start here: http://slashdot.org [slashdot.org]

    Thanks so much.
  • Text-mining programs go further, categorizing information, making links between otherwise unconnected documents

    For any google results
    "Category" is shown right on top of the results.
    "Links" - try link:slashdot.org & related:slashdot.org as google queries.

    If someone is doing research on computer modeling, for example, it not only knows to discard documents about fashion models but can also extract important phrases, terms, names and locations

    Try the google advanced search you can search with "all of
  • Hey anyone else think this [nytimes.com] picture was really cool?
  • Well, DUH! (Score:3, Insightful)

    by djeaux ( 620938 ) on Friday October 17, 2003 @04:13PM (#7244119) Homepage Journal
    How well computers truly make sense of what they are reading is, of course, highly questionable, and most of those who use text-mining software say that it works best when guided by smart people with knowledge of the particular subject.

    May I offer that computers make no sense of what they are reading & that "smart people with knowledge of the particular subject" aren't optional if the results of text-mining are to be of any usefulness whatsoever, at least in any kind of reasonable time frame.

    Otherwise, the text-mining computer is playing the old "99 monkeys with typewriters" game...

    • i would say that is a problem of the underlying data, not of the textmining per se.
    • It's not as bad as you think. Check out Vivisimo [vivisimo.com]
    • The compelling dream is that you laboriously load up a computer with enough facts so that it can glean understanding of what it's reading, and one glorious day the computer has enough smarts to make sense of things on its own, and two weeks after crawling the entire Internet, it knows everything.

      Hence Doug Lenat's Cyc [cyc.com], now partly open source [opencyc.org]. Unfortunately that glorious day has been "a few years away" for over 13 years.

      The knowledge base is built upon a core of over 1,000,000 hand-entered assertions

      • But I haven't come across any postings from Cyc on Slashdot correcting misinformation and lies.

        How do you know? You cannot verify human authorship just by looking at the text. Perhaps the Goatse troll is really an AI bot.
  • Do you have to where a minning hat:p
    -Seriv
  • Mining for data that might be related based on proximity, either temporal or locational, starts to get interesting when you are dealing with millions of interactions like in a call center on voice data (check out www.callminer.com) and suddenly you find out that when a customer says "hurricane" in an insurance call center, your agents are 5x more likely to hand them off to a supervisor, is real money saving information. This is what this technology is good for, and is being bought and used by a lot of compa
  • by koekepeer ( 197127 ) on Friday October 17, 2003 @04:35PM (#7244331)
    i always wondered about this

    allright, you can take huge amounts of text and apply some smart tricks to extract patterns from it.

    but how can you determine whether the original data was trustworthy?

    take the example of genome annotation (description of gene function), which would be helped greatly by including more functional descriptions from scientific literature. how do you determine whether the original publication was backed by solid experimental research?

    by the reviewers of the articles? i don't think so, peer review is a snakepit filled with politics. by the amount of people who cited it? hmmmm... so hip subjects are more true?

    me personally, because i'm experienced, can recognise bullshit articles when i see them. but how to translate this into an algorithm... anyone any ideas about this? or even working solutions?

    (of course this is an example from my field of expertise - biology, but it applies to any set of text data/articles IMO)
    • Hmm, I guess you cannot say that for sure, but most systems today use trust metrics.

      For example, an ACM/IEEE source would have a much higher trust metric than say, from some local conference in Egypt (no offence to any local conferences in Egypt, but you get the wind :)
      • i see the point, but is this truly representative of realiability?

        you rely on peer review, on citation indices, so mostly IM-not-so-HO on matters of politics.

        when you scan abstracts yourself, you can dig into the detail when something looks interesting enough, but the decision making process that drives me while scanning abstracts is not much influenced by the fact whether it is in a high impact journal (or any other high impact publishing body) or in something mostly not noteworthy.

        to put it in another
  • Some notes... (Score:2, Interesting)

    by ekephart ( 256467 )
    (1)"Of course, no one, Dr. Liebman included, is arguing that these products are actually reading anything. What they are engaged in is "text mining,'' "

    Dijkstra once said "The question of whether computers can think is like the question of whether submarines can swim." ... Just a thought.

    (2)As noted in the article sarcasm is very hard to detect. If you think about it even many people have a hard time recognizing it. How are we supposed to develop an intelligent system when we "intelligent" humans don'
  • You know how those guys at MIT are constantly trying to figure out ways to teach their robots how to interact with people? Let the robots roam the Internet with a topic in mind. If I'm at a party with a bunch of dog groomers I'm probably not going to say much. I'm sure robots have the same issue; they have nothing in common with us. If we start by making a Cancer-Expert-Bot then let it try to have a conversation with an oncologist I think AI will have more success.

  • They just had to get it in somehow :

    like the 858-page report on the congressional inquiry into intelligence failures regarding the Sept. 11, 2001, terrorist attacks.
  • I spent more years than I care to admit writing natural language processing software that tried to extract semantic information - conceptual dependency, parsers, etc.

    I gave up a few years ago, now I mostly use statiscal approaches (markov processes, word counts, huge databases of proper names, etc.)

    -Mark

    • .. I meant "statistical approaches", not "statiscal approaches" ..

      (I was trying to type while holding my wife's baby parrot, and he sometimes goes nuclear if you don't pay enough attention to him :-)

      BTW, pardon the shameless plug, but I added a short chapter on statistical nlp (simple enough example program to understand easily) to my free Java/AI web book.

      -Mark

  • NYT won't be contributing to this large body of text, because registration is STILL required.
  • skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.

    I like to call it High School.
  • but could a person who had sufficient knowledge of the program(s) build a large document (say, the giant 9/11 intelligence-failure doc mentioned in the article) so as to fool the text-miners? Subtle misinformation -- let's say that widespread use of text miners results in larger docs being published, then unscrupulous types bury information in such a way that a ridiculously long human endeavour will turn them up, but the programs won't, so those responsible can say: "See? It's all in there. Your program is
  • MS Word has a surprising "summary" feature that has given me impressive results in portuguese. How the hell do they do that?
  • KDD Cup (Score:4, Informative)

    by apsmith ( 17989 ) * on Friday October 17, 2003 @11:18PM (#7246610) Homepage
    The knowledge discovery and datamining cup challenge this year [cornell.edu] was looking at the arxiv.org [arxiv.org] papers for this sort of analysis - some very interesting results. The Task 4 winnder [umass.edu] looked at the structure of the papers as a sort of relational database and uncovered a lot of statistical patterns and metrics that could be quite useful for scientists.
  • "I was an FBI agent for 20 years," said Randall Murch, now a researcher at the Institute for Defense Analyses, which works for the Office of the Defense Secretary and other government agencies. "And I have yet to see anyone who is able to model the way an agent thinks and works through an investigation."

    Apart from suggesting the jibe that, of course, only an ex-fbi dick could think that anyone would want to model his/her behaviour, this misses the point that text-mining is intended to find precisely thos
  • think before they use a hammer. Using software to fix problems that exist within their human intelligence arena is soooo typical. The bit about subrogation is so idiotic, I can't believe it. Any idiot can check a box on the report if there is a basis for subrogation. If there is enough data in the report to determine a basis for subro. then the adjuster obviously knew that it should have been handed to the subro. dept. from the outset. There is obviously an issue here. The adjusters are reluctant to send c

Only great masters of style can succeed in being obtuse. -- Oscar Wilde Most UNIX programmers are great masters of style. -- The Unnamed Usenetter

Working...