Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Slashdot Log In

Log In

Create Account  |  Retrieve Password

Text Mining the Multiverse

Posted by michael on Fri Oct 17, 2003 03:41 PM
from the mother-lode dept.
The NYT has a decent piece about text-mining, skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.
+ -
story
This discussion has been archived. No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
 Full
 Abbreviated
 Hidden
More
Loading... please wait.
  • Quick, someone patent it before Microsoft does or else Slashdot is going to be the next casualty.

    Then again, we could just skip the patent and let WWdN die too. Seems like the internet community would break even.
  • MICHAEL N. LIEBMAN knows his limitations. Even with a Ph.D. and a long career in medical research, he cannot keep up with all the developments in his area of interest, breast cancer. Medline, the database that already houses more than 10 million abstracts for journal articles, is adding 7,000 to 8,000 abstracts per week. Only a fraction of these are about cancer, but the volume of information is daunting nonetheless.

    "There is just too much literature to be able to go through it all," said Dr. Liebman, the
  • Why does slashdot keep linking to articles that require NYT registration? Isn't there some sort of Google news out there?

    (Yes I am a lazy /. reader)

  • I've always wanted to ask the computer to find all references to some complex interplay of topics at hand the way those Star Fleet engineers were always able to in TNG...
  • yea...

    text mining is fun until someone creates something to generate a bunch of junk to feed to the text miners..

    take a look at my .sig

  • skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.

    Like those ppl who actually RTFA and try to get "FORST PIST!!!"?
  • Multiverse doesn't appear anywhere in the article. Multiverse is a technical word, for interpreting Quantum Physics. It is totally misplaced in this news submission.
    Did the poster even know what it means ?
    • I'm guessing that this is for data in multiple versions of documents -- spatial and temporally disparate ones.

      One of the groups that I work with [gatech.edu] does some data analysis stuff with how data changes over space (location based) and time (your beliefs yesterday vs. your beliefs today) and the ilke -- so this could be something along those lines.

      Or like you said, it could just be a buzzword! :)
  • by Anonymous Coward on Friday October 17 2003, @03:46PM (#7243816)
    Brought to you by your favorite anonymous non-whoring poster: the Google link [nytimes.com].

    The same article is also posted at CNET, which doesn't require registration. They also have it in a nice single-page [com.com] format for those that don't like to keep hitting "next".

  • If I apply this to slashdot, I'll have only 3-4 posts to read everyday... What will I do all day at work?
  • To make sense of what it is reading, the software uses algorithms to examine the context behind words.

    They make it sound like Semantic and Contextual modeling is done on the fly -- the way I see this system, it does this based on a preset lexicon or database.

    Thats again brute forcing the problem -- a lot of researchers in the field feel that real solution does not lie that way. We need to analyze this from ground up, to gather meaning from data.

    The above method fails the moment you have spatial and temporal data -- my lexicon may evolve over a period of time.

    You're looking at all the information and then deciding whats for you -- a better way is to develop an "instinct" for the right kind of information and refine it.

    If you really want to know where data mining is going to, look at KDD or SIGMOD -- thats where all the real action is.
    • never did this before

      MOD PARENT UP (and me down i don't care)

      finally a /. comment that makes sense
    • I don't believe that medical terminology & medical journals contain a lot of terminology that change over time. Most medical words are latin-based and fairly rigid in their usage. A tibia is a tibia is a tibia. What the doctor has created is very specific application to skim a very large (volume) of information and report back things that might be of interest. You are correct that an "instinct" would be much more useful. This does not however make the doctor's accomplishment any less viable.

      Is text mi
    • There are more than enough opinions about "the right way" to model data from a semantic or centextual standpoint. Like most things there's the academic approach and there's one that a company can afford. Whether or not either is appropriate depends on your needs, point of view and the size of the coporate wallet.

      Sure there are those who short change some approaches because they have temporal limitations. New data comes in and you need to categorize that too and determine it's context or supremacy to data y
      • That is true, that the answer is not straight nor is it simple.

        However, one thing that I have learnt (the hard way) over a period of time is that Ontology (Specification of data conceptualization) is infinitely more important than Epistemology (Knowledge of the data).

        There is nothing wrong with a system which has tags, the trouble is when you classify it eitherway -- the references of the tags are once again more important than how they are acquired. You could perhaps have a purely automated system, maybe
  • Fun with numbers (Score:3, Interesting)

    by ajs (35943) <ajs&ajs,com> on Friday October 17 2003, @03:50PM (#7243869) Homepage Journal
    Here's some fun you can have with numbers. Take this Perl one-liner:
    perl -ne '$x{$1}++ while /(\d)/g;END{print map {"$_ occured $x{$_} times\n"} sort {$a<=>$b} keys %x}' xxxx
    and run it with "xxxx" replaced by the name of some large text file that you create by saving email messages, web pages, log files, what have you.

    The scary part (that took mathmeticians a long time to accept and longer to figure out) is that the distribution is the same for any sufficiently representitive sample of text....
    • Ooops. While you see a distribution there, that's not what I was trying to point out. The correct one-liner would be

      perl -ne '$x{$1}++ while /\b([1-9])\d*\b/g;END{$all+=$_ foreach map {$x{$_}} keys %x;print map {sprintf "%d occured %.1f%% (%d times)\n",$_,$x{$_}/$all*100,$x{$_}} sort {$a<=>$b} keys %x;print "$. lines read\n"}'

      Benford's law is the name of this phenomenon. Its even more interesting because it is independant of base!

      There are many ways that this is used, including detecting human tamp

      • Yep, binaries are a good example. Basically, in any data files that represent large systems with many variables, you should find that the Perl regular expression
        /\b(\d)\d*\b/g
        should match a 1 most often. In some types of text (especially code), you will find things like "0" show up a lot. That's why in my example, I didn't allow for single-digit numbers, but if you want to, that's cool.

        I find that a large pool of USENET posts works best.
  • ..until no student ever has to research any topic again?

    Just head over to tellmewhatthisthingyisabout.com > Print
  • That the text has to first contain some knowledge in it to begin with?

    Maybe this is just an attempt at getting a machine to generate core knowledge but then haven't they been working on common sense, which is sorta needed first?
  • "to extract some sort of refined knowledge from it." hum....
    If you have an infinite number of red necks ....Infinite number of shot guns & shotgun shells.... And an infinite number of stop signs, you will eventually get Shakespeare in brail.....
  • by Strange Ranger (454494) on Friday October 17 2003, @03:57PM (#7243964)
    ...skimming large volumes of miscellaneous text to extract some sort of refined knowledge from it.

    Dear Text Miners,

    Please start here: http://slashdot.org [slashdot.org]

    Thanks so much.
  • Well, DUH! (Score:3, Insightful)

    by djeaux (620938) on Friday October 17 2003, @04:13PM (#7244119) Homepage Journal
    How well computers truly make sense of what they are reading is, of course, highly questionable, and most of those who use text-mining software say that it works best when guided by smart people with knowledge of the particular subject.

    May I offer that computers make no sense of what they are reading & that "smart people with knowledge of the particular subject" aren't optional if the results of text-mining are to be of any usefulness whatsoever, at least in any kind of reasonable time frame.

    Otherwise, the text-mining computer is playing the old "99 monkeys with typewriters" game...

  • by koekepeer (197127) on Friday October 17 2003, @04:35PM (#7244331)
    i always wondered about this

    allright, you can take huge amounts of text and apply some smart tricks to extract patterns from it.

    but how can you determine whether the original data was trustworthy?

    take the example of genome annotation (description of gene function), which would be helped greatly by including more functional descriptions from scientific literature. how do you determine whether the original publication was backed by solid experimental research?

    by the reviewers of the articles? i don't think so, peer review is a snakepit filled with politics. by the amount of people who cited it? hmmmm... so hip subjects are more true?

    me personally, because i'm experienced, can recognise bullshit articles when i see them. but how to translate this into an algorithm... anyone any ideas about this? or even working solutions?

    (of course this is an example from my field of expertise - biology, but it applies to any set of text data/articles IMO)
    • Hmm, I guess you cannot say that for sure, but most systems today use trust metrics.

      For example, an ACM/IEEE source would have a much higher trust metric than say, from some local conference in Egypt (no offence to any local conferences in Egypt, but you get the wind :)
      • i see the point, but is this truly representative of realiability?

        you rely on peer review, on citation indices, so mostly IM-not-so-HO on matters of politics.

        when you scan abstracts yourself, you can dig into the detail when something looks interesting enough, but the decision making process that drives me while scanning abstracts is not much influenced by the fact whether it is in a high impact journal (or any other high impact publishing body) or in something mostly not noteworthy.

        to put it in another
  • (1)"Of course, no one, Dr. Liebman included, is arguing that these products are actually reading anything. What they are engaged in is "text mining,'' "

    Dijkstra once said "The question of whether computers can think is like the question of whether submarines can swim." ... Just a thought.

    (2)As noted in the article sarcasm is very hard to detect. If you think about it even many people have a hard time recognizing it. How are we supposed to develop an intelligent system when we "intelligent" humans don'

  • They just had to get it in somehow :

    like the 858-page report on the congressional inquiry into intelligence failures regarding the Sept. 11, 2001, terrorist attacks.
  • NYT won't be contributing to this large body of text, because registration is STILL required.
  • KDD Cup (Score:4, Informative)

    by apsmith (17989) * on Friday October 17 2003, @11:18PM (#7246610) Homepage
    The knowledge discovery and datamining cup challenge this year [cornell.edu] was looking at the arxiv.org [arxiv.org] papers for this sort of analysis - some very interesting results. The Task 4 winnder [umass.edu] looked at the structure of the papers as a sort of relational database and uncovered a lot of statistical patterns and metrics that could be quite useful for scientists.
    • This is OT, but read this journal entry [slashdot.org] from CmdrTaco.
    • yes, I have noticed the metamod thing too..

      in addition there is also this tasteless group of guys who keep making posts about greased-up yoda dolls which has also forced me to start browsing at +2..

      seems that mod points are being handed out with less frequency than they were before.

      I think they should start handing them out for people with "excellent karma" and then track if the metamods agree with the point distribution..

      that is just me..

    • Answers to your questions: HERE [slashdot.org]
      • I agree about the quality of recent stories. I use to look forward to refreshing the main page all day and seeing an interesting story pop-up every hour or so, one that generates a couple of hundred comments and several deeply-nested threads.

        Now I refresh and see a review of a pirate book with ~70 "+2" comments and "Third Anniversary of Bezos-Backed Patent Reform," which went completely ignored. Meh.

        Of course, I'm not helping by posting near-useless comments like this...
        • And naturally, the few mods that are around have basically wasted a dozen or so points by methodically modding this entire thread Off-Topic. Not that I mind, but for Pete's sake, there's not a single +5 Mod in this topic yet! There's only 4 "+3"s posts! At least try to be constructive, for crying-out-loud! =P
    • .. I meant "statistical approaches", not "statiscal approaches" ..

      (I was trying to type while holding my wife's baby parrot, and he sometimes goes nuclear if you don't pay enough attention to him :-)

      BTW, pardon the shameless plug, but I added a short chapter on statistical nlp (simple enough example program to understand easily) to my free Java/AI web book.

      -Mark