Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
Programming IT Science Technology

Mining Unstructured Data 105

jscribner writes "Data these days tends to an unstructured form, be it text (like the web, email, or books), spoken word, or even in DB's with unique organization (and thus a discrete language). There's a new article on Unstructured Data in Think Research; it's an overview of the challenges, progress, and potential rewards in this area. I'm leaving on your doorstep because, to me, it's a good launching point for discussion of several interesting possibilities: /. as a minable DB of ideas, email identified by interpretation rather than keywords, emotive XML, etc."
This discussion has been archived. No new comments can be posted.

Mining Unstructured Data

Comments Filter:
  • They better get ready to pay some google patent licensing fees:
    People also make their feelings known in less direct ways, says Jhingran. "People actually vote their preferences by providing links to different documents," he explains. "You may be able to determine that a page is authoritative because lots of people have found it important enough to have links to it. People explicitly create links from page one to page two, and if many people point to page two it looks like it is an important link to something." Businesses could use such analytical capability to determine the "buzz" about their products found in chat rooms and forums on the Internet.
    • The scientific community has been fascinated with the topology of indexes long before Google. I can remember in college (1980s) coming across companies that tallied the journal citations on all academic journals and produced various reports on the influence of various writers, and trends. (of course, I went senile and can't remember the names of the organizations.) In any case, the bibliography of an article is often as interesting as its contents. If I ever had any spare time and cash, my plan was to turn y-intercept.com [y-intercept.com] into a place where people could track the citations in different books. In any case, google didn't come up with a new idea, they are just applying an old idea to the web.
  • answer is perl (Score:1, Interesting)

    by Anonymous Coward

    http://www.google.com/search?q=learn+perl+for+hu ma nities+student+data+mining

    (remove the silly space in "humanities"

    perl and a lot of thinking, that is.
  • by Anonymous Coward
    ..it basically converts human readable (unstructured data) into computer-readable, structured data. Parsers are in the works for converting unstructured services into standard services; for example the inboxes of Yahoo, Lycos, Mailcity, Excite, etc. are converted to an internal form, which is later served via a POP3 server.

    Of course, this isn't limited to web-based e-mail, there are parsers to parse web-based forums and bulletin boards, yes even for Slashdot. The unstructured data here can be converted and served via NNTP (NetNews) or some other method.

    There's a huge amount of unstructured data and services available on the Web, making these available to computers is a huge step forward in information technology.

  • Oh Man (Score:1, Insightful)

    by Tadrith ( 557354 )
    This is a fascinating article. I'm especially interested in it, because I tend to work with databases - most of which are created from completely unstructured data.

    For instance, Company A is tracking all their data in a Microsoft Word document. Frequently I get asked to dynamically work with this data, and pull it directly out programmatically. I can attest to how difficult this can be sometimes, and I frequently find that upper management doesn't understand the challenges behind pulling unstructured data out.

    I definitely recommend this article - especially when trying to explain to your boss why you can't flick your magic wand, and *poof* the data moves from his text file into a database.
    • I think upper management (and the general public) think that a computer is some sort of magic box. "The data is right there! Why can't you just take it from the Word document and put it in the database? I can understand it, so why can't the computer?" But people have been working on automatic language understanding for over 50 years and haven't even come close to solving the problem. I work in natural language processing, and I can attest, it's tough.
    • I definitely recommend this article - especially when trying to explain to your boss why you can't flick your magic wand, and *poof* the data moves from his text file into a database.

      And pray your boss hasn't heard of Perl :)

  • by bravehamster ( 44836 ) on Friday March 15, 2002 @07:28PM (#3170971) Homepage Journal
    email identified by interpretation rather than keywords


    A Machine will be considered truly intelligent when it can translate all emails on slashdot into a usable form. Since spammers are some of the most persistent and aggressive users and developers of technology, I expect we'll have real AI telling us how to enlarge our penises by next Thursday.

    • if the machine can hack into /. database servers and get all of our real email addresses, then I will consider it really smart (unless /. is running IIS/MSSQL these days, it would require not much intelligence then)
  • by theonomist ( 442009 ) on Friday March 15, 2002 @07:29PM (#3170975) Homepage

    Oooookay.

    Sir? Please step away from the bong.

    I just spent an ejoyable half hour or so reading Business 2.0's "minable database" of 101 Dumbest Moments in Business, and then I had a look at their even-more-hilarious 100 Dumbest moments in e-Business [business2.com]. This article really does have that weird flavor of megalomaniacal Internet-hype gibberish that we all came to know so well during the boom years. In a way, it's a pleasant little nostalgia trip to see the same old idiocy presented with the same old mindless confidence, but in another way it's just depressing.

    Reality Check: Slashdot is a BBS for bored IT workers taking a break while installing nine hundred copies of Word on nine hundred 266 MHz beige boxes at the local credit union. It is not a minable database of ideas (or at least not of ideas worth mining). At its best, it's an undergraduate bull session.

    What the hell are you people smoking?

    • You bastard! Can you imagine what it's going to do to the miner's self-esteem, when it reads your message?

    • what that list doesn't tell you is that all those stupid stock market analysts were doubling their money every month or less, because the rest of us suckers bought their hype and false predictions. Payne Webber, Merrill Lynch, Credit Suisse, Oppenheimer & co. consolidated the largest percentage of the world's money since the London Bay Company & East India Company. And they haven't lost a penny of it. (WTC offices were insured)
    • It depends on the article ... when was the last time you read every article in a given week and every message attached thereto?

      _Sometimes_ unique or semi-unique or thought-provoking ideas get stated. That's the nature of chat rooms and discussion boards. USENET has unique ideas as well and is much more spam-filled and useless looking on first glance.
    • While I would say that the vast majority of posts on /. are mere discussion, etc - there is a small but useful subset buried deep within that arguably contains useful information, or at the very least would serve as a starting point for further research.

      There are a TON of "Ask Slashdot" articles with very valuable information. I also see in this article many valuable posts (especially that one with the tons of links on machine learning and mining). I also remember some valid ideas bandied about back on the homebrew rollercoaster posting. I also remember seeing information on a posting about mozilla yesterday talking about how to get nice looking fonts in X. Finally, I remember quite some time back (possible up to 2 years ago) an article on AI, in which two individuals, who seemed to know their shit at minimum, and at best were both neuroscientists - arguing about how neurons worked and how the brain "thinks" - most of that went WAAAY over my head, but it was valuable information (or at least it might be a stepping stone).

      I see this all the time here on /. - true, there is a ton of SPAM and troll posts, etc to wade through, but that is what we are discussing here - how do you "mine" through the ore to get to that nugget of "gold"?
  • Slashdot (Score:4, Funny)

    by rbgaynor ( 537968 ) on Friday March 15, 2002 @07:30PM (#3170989)

    Interesting, my mining of hot ideas on Slashdot has determind that a Beowolf Cluster of First Posts is the next big thing...

  • by MosesJones ( 55544 ) on Friday March 15, 2002 @07:31PM (#3170992) Homepage
    By definition this is an unsolvable problem, because what it requires is definition of undefinition (if such a term exists). While you can make assumptions on unstructured data and apply Natural Language rules across it you are still left with the possibility that you've interpretted incorrectly. So to create definition in a loose format inherently requires you to assume its meaning, the rate of accuracy can be improved but absolutes are impossible to attain.

    Simply put, if you don't understand what someone is talking about, you can make a reasonable guess and then refine it but you are always making assumptions.

    To put in it simple terms for George W. Bush

    All Muslims are Terrorists
    All Supporters of Militia (McVeigh) are terrorists

    Unstructured data is a great way to make money, and a great way to get 80% of the story, the trouble is the other 20% gets destroyed in the process.

    Welcome to 1984, and a Brave New World, the minority will cease to count.
    • By definition this is an unsolvable problem, because what it requires is definition of undefinition (if such a term exists). While you can make assumptions on unstructured data and apply Natural Language rules across it you are still left with the possibility that you've interpretted incorrectly. So to create definition in a loose format inherently requires you to assume its meaning, the rate of accuracy can be improved but absolutes are impossible to attain.

      Probably it should be randomly structured data, but in any case, the problem still boils down to how you described to, trying to decide what is relevant and how. Other wise you just have a bunch of blobs.

      Or else you have a database with links to the random objects (word docs, etc.), but descriptions, etc in the database about the objects. Quick and dirty, but not the best solution.

      • Or else you have a database with links to the random objects (word docs, etc.), but descriptions, etc in the database about the objects. Quick and dirty, but not the best solution.

        which, come to think of it, is what is happening in XML anyhow, you are adding tags in the file instead of having descriptive data outside in the database.

        YMMV as far as which method will work better for you.

      • Re: (Score:2, Interesting)

        Comment removed based on user account deletion
        • As I mentioned elsewhere in this discussion, the problem with XML is that it must be fully nested. This is, for many types of unstructured data, a horrible situation. The problem is that when mining for data you often don't have the structure but are creating the structure. This relates various contexts in ways that don't fit the requirements of an XML topology. An example of this is relating pages to paragraphs. Paragraphs aren't always nested within pages. One structure can cross the borders of the other structure.

          However once you have some structures (say basic linguistic units like sentences, words, paragraphs, pages, speakers, etc.) you can then create other ones. From those structures you can then use various techniques to develop more informtion.

          Once again, great in theory, complex in practice. However many of the issues used in NLP to understand words can then be expanded for larger units of meaning. Further you can then start to relate various types of contexts. Of course how helpful all this is relates to the type of analysis you are making. Some practical problems are very solvable now. Other problems are more complex.

          But consider some future "Google" which indexes pages based not on words but on concept spaces. It then uses other methods, such as the links to a page and so forth, to rank not just pages but concept *spaces* within a page. Finding information would be much, much more helpful.

    • Unstructured data is a great way to make money, and a great way to get 80% of the story, the trouble is the other 20% gets destroyed in the process.


      Nothing is destroyed. The original data is still there. These technologies are best used to summarize and search large information archives (e.g. the web). Does Google destroy any data? No it merely indexes it in a certain way. In fact search services often make data more easily accessible, the opposite of what you are arguing.
      • No, what he's arguing is that the context is effectively destroyed. The methods are good enough to get 80% of the meaning, but the other 20% is lost if all you do is thumb through the search results. The only way to restore that other 20% is to read the actual documents themselves, using human reason and judgement to come to logical conclusions (assuming the reader is capable of such a thing in the first place).

        Max
    • The other 20% gets lost anyway. Who really reads an entire discussion on /. carefully? The hope for parsing unstructured data is that redundancies can be aggregated, reducing the amount of time needed to consume the full range of ideas in a given set of documents...
  • Good use of XML (Score:2, Informative)

    by soap.xml ( 469053 )

    From the article:

    One tool used to corral unstructured data is XML (extensible markup language), which tags salient parts of unstructured electronic documents so they can be searched. The structure of XML documents resembles that of a tree, with branches of tagged information, while relational databases consist of regimented rows. "Being able to produce, accept, store, and search XML provides a little structure to unstructured information," explains Selinger of the Silicon Valley Lab.

    This makes a lot of sense. When you think about it, things like images and audio clips can provide some very useful information, but they can be difficult to classify and store in a useful and searchable manner. Having a product or suite of products that would provide the facility to not only classify, but also search the many different types of XML signatures for each type of resource could prove to be a very valuable thing for buisnesses.

    Imagine the amount of time that could be saved if you could simply search all of those images/diagrams that you have for different projects, and all of the audio clips from that conferences that you have attended for that key idea that your sure is in there, but just can't remember where!

    -ryan
    • Interestingly enough, relational database technology itself was created to overcome the limitations of hierarchal databases(aka tree-based data structures). The problem back then was - not everything can be organized into a nice little hierarchal tree of data, however if you could create relations between otherwise unrelated pieces of information you could tie together all sorts of disparate data.

      Seems to me like we're coming full circle with OOP and XML - trying to create huge monolithic structures that can handle everything we need to do. Look at Java - everything ultimately is inherited from the almighty Object. XML is no better in this regard, although you can have lots of different XML files describing different pieces of data.

      I wonder if the people pushing hierarchal(OOP and XML) data models over relational ones realize that the exact opposite was the case 30 years ago. Perhaps we should have stuck with hierarchal databases in the first place?
      • It's really a case of using the right tool for the right job. After all, some data is not well expressed in a tree, while some is not well expressed in a relational database. Does this mean it's more right to use one or the other? Too often I see people using XML just because it's new, and not because it actually makes the data easier to work with.

        As for the Object hierarchy in Java, it really doesn't limit what you can do with the objects and classes... you can still have a class with no data and only static methods, which is just like a function in C. The nice thing about the automatic Object superclass is that it makes generic, heterogenous containers really easy to use.
      • Interestingly enough, relational database technology itself was created to overcome the limitations of hierarchal databases(aka tree-based data structures). [...] Look at Java - everything ultimately is inherited from the almighty Object.

        Don't mistake a hierarchical type structure for a hierarchical data structure.

        In Java, one might model things so that Persons and Vehicles are both subclasses of Object, and that Cars and Trucks are subclasses of vehicles. This is indeed strictly hierarchical.

        But a Person called Joe can be the owner for a Truck, ride in a Car, and be the spouse of another person Jane simultaneously. That's not a hierarchical relationship; it's a web of connections.

        You can still have hierarchical relationships with OO data; if Joe sells his truck, the Engine and the four Wheels would automatically go along with. But that's just one possible relationship.
    • Actually depending upon the kind of data you are mining, XML is very poor for this. Consider a simple structure that exists in every book. You have pages, paragraphs, authors, quotes, and so forth. The problem is that different blocks are not always within other blocks. (i.e. nested, the way inner loops are always nested in a programming language) Instead a paragraph block can be half in one page block and half in an other.

      That doesn't sound like a big problem, but it can be when you are using regions to map out new concepts. (i.e. analyze a class of words in all sentences that contain the concept of Apple computer) In practice writing "concepts" to analyze (data mine) texts of this sort is very hard. Further using tools like Perl can be a pain. Yeah you can do it, but you probably won't do it well.

      I know that the company Sageware [sageware.com] which I have dealt with does what this article describes. However it supplies various "objects" for mining for concepts. It ends up being tricky stuff which is why mainly large portals use the technology.

      The basic notions can apply to Perl or simple C code. Go very complex though and things get messy very quickly.

  • This sounds interesting - particularly how essentially this is something that makes an unstructured filing system suddenly become a structured filing system. What implications does this have for UK law?
    UK data protection states that copies of email have to be kept for a 28 day minimum period. It advises that "email is a transitory medium" and our company person in charge of such policies has just written a policy that says I'm supposed to program our mail systems to auto-delete mail after a three month period. Staff are supposed to save their emails that they want to keep to their local hard drives, as they suddenly become "documents" rather than emails.
    Why? because in the UK any individual can legally ask for copies of any email that mentions them individually by name. Local hard drives can be searched, however this is only if the documents are stored in a "structured filing system". I have raised concerns about what constitutes a "structured filing system" to the point where I would argue that FAT, NTFS and HFS are structured due to the fact they utilise indexes. Add to this the new MS Object Oriented Filing System (OFS) that is basically going to be a simplified version of SQL server as a filing system, is the ability to search previously considered "unstructured" data going to complicate the UK law?
  • by johncheng ( 308479 ) on Friday March 15, 2002 @07:40PM (#3171033)
    This article will have great importance to our director of IT, since the way our company stores data seems to completely unstructured.
  • Why is it that the very thought of mining Slashdot makes me think of the goatse.cx guy?

    Hell, I've really got to stop reading at -1 so much.

    --saint
  • Google Made to Order (Score:3, Informative)

    by shalunov ( 149369 ) on Friday March 15, 2002 @07:41PM (#3171040) Homepage
    Some quotes from the press release:
    People actually vote their preferences by providing links to different documents. You may be able to determine that a page is authoritative because lots of people have found it important enough to have links to it. People explicitly create links from page one to page two, and if many people point to page two it looks like it is an important link to something.
    This Discoverylink(TM) search engine concept somehow sounds very familiar. Where could I have heard this innovative idea before? Or, as the press release asks, "Where did I read that?" Ah, yes! [google.com]
    • "Where did I read that?" Ah, yes! [google.com]

      Of course if you were in IBM Research, as the authors are, you might have been familiar with The Clever Project [ibm.com] prior to Google. It is explained very nicely here [sciam.com].

      I am not saying that the authors might not have been inspired by Google, but I am saying that Google isn't the only possible source of their inspiration.

      • I am not saying that the authors might not have been inspired by Google, but I am saying that Google isn't the only possible source of their inspiration.
        The concept of using links as votes to rate resources is simply not novel anymore. Everyone knows about it. I'm not claiming that Google invented it (not being a very non-obvious idea, this was probably independently developed at a number of places); but presenting stuff familiar to everybody as "Invented Here" news sounds like PR.

        But wait, it was a press release. Submitted by someone from IBM, too.

  • by candot ( 513284 ) on Friday March 15, 2002 @07:41PM (#3171041)
    They draw you in with the bit about unstructured data, but it turns out to be more about differently structured data. I think they missed their own point.

    I just attended the Knowledge Technologies [knowledget...logies.net] conference in Seattle. It's scary how many people think the way to mine unstructured data is to force it into a structure. So many people spending years developing standard taxonomies--different standards, of course. And so many companies (like Semio [semio.com], for example) that want you to develop your own taxonomy. Then you wind up with the very problem this article really discusses.

    [Skip next section to avoid my self-promotion]

    I'm a big fan of mining unstructured (and differently structured) data by throwing a mining layer on top of it. All of us at Think Tank 23 [thinktank23.com] are. Check out the demo of our technology, Waypoint 2.0 [thinktank23.com], which pulls concepts from unstructured documents, then uses the concepts as the basis for finding relationships between them.
  • by Anonymous Coward on Friday March 15, 2002 @07:45PM (#3171055)
    "email identified by interpretation rather than keywords"

    Report: The attached email messages indicate a successful business plan. This simple way to make money fast by selling pamphlets is interpreted as being good: it has been confirmed by many quotes within the email, by repetition in many similar emails, by the suggested calculation of potential return.

    Opportunity: There is an unfilled business opportunity which is confirmed by the lack of existing businesses which use this plan. Searches of local and national databases have not found any businesses which are using this method.

    Suggestion: Give me a dollar so I can start a business.
  • These guys [visiblemind.com] have some interesting tech revolving around semantic search... worth a boo anyway...
  • We used perl regular expressions and lex/yacc
    like tools to tease structured data out of semi
    structured web pages and other listings. It's
    doable if you limit your scope to one particular
    subject, such as job listings. The hardest part
    is creating contextual lexicons. Does MS mean
    that a master degree is required? The job is
    located in Mississippi? Expreince with Microsoft
    products is required? The hr contact is Ms.
    Smith? You have to figure it out based on
    context. Is MS preceded by a city name, that type
    of thing.
  • A minor nitpick with the article... when the term "natural language understanding" is used, it seems to be mostly synonymous with "speech recognition". Actually, speech recognition is a subset of natural language understanding. NLU (or NLP, natural language processing) deals with all aspects of understanding human languages. In fact, most NLP is done with text, not speech.
  • by waimate ( 147056 )
    Of all the information stored in computers, 80% of it is unstructured, and arguably it's the most valuable 80%, too.

    Think of the informal knowledge embodied in the emails sent and received, attachments, spreadsheets, favorite websites, your colleagues documents, as well as SQL databases and the like. There simply is no suitably shaped container that you can put amorphous knowledge into. It defies structure, and XML is no answer.

    Useful knowledge is of a pervasive nature. It infuses through everything, and often the really useful bits are where you least expect it, so therefore attempting to design a structure, a priori, to hold it is always doomed to failure.

    The key here is polymorphic searching of both structured and unstructured data without distinction. That's where products such as ISYS [isysusa.com] earn their salt. The hard part is in convincing the blissfully unaware that knowledge is being wasted in the first place.

    The other key concept is value. Large result lists are less useful than small, high-quality result lists. Everybody knows this from using Google and getting back 198,000 hits. In the old CB radio days, it was called a squelch knob. Search engines that just give you large amounts of static do you a dis-service. Useful results are small and targeted.

    • Irony is of course that ALL data that we recieve and send for human consumption is INDEED structured.

      It is just not structured according to how the COMPUTER sees it.

      Hell this posting of mine right here is structured, and beyond the obvious sentances/paragraphs explanation that is most often given.

      Almost all written work is designed as so to allow for the reader to follow along the author's thought process.

      Indeed writting could be looked at as some sort of bare level one shot emulation code for the human brain.

      Now for computers this makes NO sense at all.

      Uh duh, they don't think.

      Now with a lot of work native languages can indeed be PARTIALY understood by computers, and there is an artificial language out there (I forget the name) that was designed from the ground up for both comptuer and human understanding on a quasi-equal level. But even so it cannot match the same. . . . underlying meanings between both parties.

      Humans are capible of understanding all of the complexities of modern day computers, it may require a lot of work and some darn good wizardy, but it IS possible.

      The issue is that the way that computers 'think' is not but a subset of our own thought methods that we have expanded upon and made more complex but ultimatly added nothing new too.

      And yet it is by the very nature of being a subset that computer 'thinking' (ugh I hate using that term in this context) can only contain a partial set of the abilities of Human thinking.

      Ah, to take a related explanation from Dansdata [dansdata.com]

      "But a clever enough algorithmic composition system can get around this, by using a human to direct it through infinite musical space. With any luck, the human will have some idea of what sounds good; that's a really difficult thing to teach a computer."

      (speaking about the Kong Karma's composition functions)

      Humans have to GUIDE the computer.

      For instance the file finder feature on many OSs.

      If I tell my Windows box to search for mIRC* it will search my entire computer's hard drive including my Cygwin folder and my C:\corel folder.

      Which is obviously highly friggin stupid since mIRC is NOT going to be in either one of those. (well not today at least. :) )

      But the COMPUTER does not know that. Despite having a highly refined layout system for my files that has everything compacted into nice small little subsets of subsets as to what types of file it is, the damn computer has;

      No idea WTF mIRC is, what IRC is (outside of some sort of program that tells the computer to interpet network packet X with Y evaluation system and display Z depending on X's contents, and oh yah shove the word IRC on the window while your at it. That is ALL computers know of IRC), what the hell a 'program file' is or why in the world (no concept of 'why' either) mIRC would be in C:\program files\

      Now if I use a bit of human judgement and direct the computer to search only C:\program files\ it can find the requested files just fine.

      But it is STUPID. Period.

      What is the BEST possible outcome we can hope for in this situation? Hmm?

      Hah. All files in some sort of a database system? Make it 'object based'? Or just add assloads of data to the 'file fork'.

      Bah it would STILL come down to the computer going over each friggin entry in a database until it gets a match with the search string. Hell even if some more efficent searching algorithm is used besides just going through every item in the database, the fact is that the computer

      (pay attention here folks)

      STILL HAS NO FRIGGIN IDEA AS TO WHAT IN THE HELL mIRC is.

      I can add descriptors to heck to all files associated with the program. And the computer will STILL NOT KNOW WHAT mIRC IS!

      Once again.

      THE COMPUTER HAS NO IDEA AS TO WHAT THE HELL ANYTHING IS.

      For instance.

      I know off hand that my copy of virtual dub is in F:\video editing tools\virtual dub\ (actualy the version number follows it, but close enough. :) )

      Now the computer has no idea as to what 'video editing tools' is (I am using is here folks, plural? Huh, whats that? what is 'what'. The computer does not have an understanding of ANY of these topics.)

      In fact, one thing that SO many people seem to forget, is that COMPUTERS UNDERSTAND NOTHING.

      Nothing AT ALL.

      PERIOD.

      So please.

      Please.

      PLEASE

      Understand that the computer will NEVER be able to truly organize or structure your data, because the computer does not even know what the hell a structure is. Sure you can tell it to shove such and such bits into such and such places, but it knows not what those bits are or what those bits mean or what those places mean or what the hell a place is or ANYTHING ELSE AT ALL.

      I can make my computer feel happy.

      I have it show "I am happy" on the screen.

      That is as close as you are ever going to get the current breeds of computers to being able to understand or think about anything at all.

      Because everything eventualy comes down to that same basic fact.

      The computer does what you tell it too and nothing else.

  • Isn't XML only part of the solution? I'm pretty sure RDF [w3.org] comes into play somewhere here.

  • XML won't make it (Score:4, Insightful)

    by mangu ( 126918 ) on Friday March 15, 2002 @08:16PM (#3171183)
    To encode information in XML is as much work as doing it in SQL or any other language. What is needed is artificial intelligence, to take any data source, be it a picture, text, music, or whatever, and classify it. Some examples of what I have wanted for:

    - show a text and find other texts about the same subject.

    - hum a tune and tell find an mp3 of the same music.

    - show a picture and find other pictures of the same girl.

    - better, show a picture of a girl's face and tell your search engine to find nude pictures of the same girl...


    Until those simple tasks can be done easily, we will be stuck with the 13500 links one gets when searching for "christina ricci nude" in Google.

  • creative uses (Score:3, Informative)

    by rnd() ( 118781 ) on Friday March 15, 2002 @08:17PM (#3171185) Homepage
    There are some companies that are doing some creative things [redmind.com] with this kind of technology. [redmind.com]

    It makes you wonder how much of this is based on theoretical linguistics [stanford.edu] and formal semantics [mit.edu], and how much is based on good old fashioned statistics [nec.com] and optimization.

    • It makes you wonder how much of this is based on theoretical linguistics [stanford.edu] and formal semantics [mit.edu], and how much is based on good old fashioned statistics [nec.com] and optimization.

      I can't speak to the work discussed in the original post, but I do know that in the real world a formal linguistics/semantics approach is impractical. These systems require complete or near-complete knowledge structures to work at all. They are brittle, meaning as the world changes they fail to adapt to the changing lexicon. Formal systems are often computationally expensive, and scale poorly to large data sets. The practical problems of constructing and maintaining the formal knowledge structures quickly overwhelms the advantages they have over looser approaches.

      So in most cases it is a hybrid of machine learning and statistical techniques that are used in these systems.
      • Often you can mix bits of formal systems with bits of statistical systems. Depending upon what you need, it can get you quite a ways. Of course formal structure (besides being problematic philosophically) is pretty much beyond anything we could conceive of writing. However you can do things like write a statistical part of speech tagger and then use those structures to find direct objects. Tricks like that often are very helpful in mining data.
  • What happens when you don't know what your even looking for? Data mining is more about ways to automaticly find interesting ways of indexing and displaying data than simply looking up known values in unstructured data.

    There is a package that is good at displaying unstructured data and letting you see strange patterns since it has tools to find patterns in the data. Its called Partek [partek.com].
  • E2? (Score:1, Funny)

    by Anonymous Coward
    Unstructured, like [Everything2]?
  • by rho ( 6063 ) on Friday March 15, 2002 @09:10PM (#3171358) Journal

    This can be handled, and is handled, by metadata. Most OSes do a limp-wristed version of it every day--"that movie I downloaded a few days ago..."

    Natural language grepping through a binary audio file is, no doubt, quite cool, but I believe mostly wasted effort. Well, wasted effort for everybody except IBM, who might sell a few more seats of ViaVoice. I say it's wasted because, most often, it's not the content itself you remember but the circumstances surrounding it. "I saw an article in a magazine, and I read it on the train on the way to Boston--it had something to do with widgets" No ammount of data mining will appropriately pull that info out of a simple text file.

    I relate all this in terms of human-interaction, i.e. the computer mining to satisfy the needs of a carbon-based lifeform who regularly purchases Big Macs. Data-mining between computer programs for other computer programs would be a different kettle of fish altogether--and there are a lot of ex-LISP hackers at MIT who would like to know how you got something like that to work, thank you very much.

    Oh, and Apple called--they'd like their Knowledge Navigator [dcltd.com] back, please.

  • There is a field dealing with this. It's called Knowledge Discovery in Databases (KDD). It's been around for a few years now. Go here [kdnuggets.com] for a slightly more technical overview. The posted article is aimed more toward the business people rather than the technical people.
  • FlipDog [flipdog.com] crawls the web and uses machine-learning technology to extract job listings from companies' web sites. You can then browse through them at their web site, filtering based on location, position category, etc.

    The technology was developed by WhizBang Labs [whizbang.com], and is quite cool. They basically take a small set of job listings that their crawler finds, have a human classify parts of the web page (job title, location, description, etc.) and then let their software program loose on it. It analyzes the human-filtered web pages, "learns" how to extract relevant data, and then uses that to classify all of the other crawled pages. Of course, this is over-simplified, but that's the basic idea.

    (I'm not affiliated with them, other than successfully finding a job there.)

    --Bruce

    • The problem with this kind of approach is that it doesn't scale well to growing repositories of content where the conceptual span changes over time. For job listings and resumes, this works well because the set of concepts encoded in the content changes very slowly. If I recall correctly, WhizBang Labs recently partnered with LexisNexis to classify legal stuff. It'll probably work there too, as long as they've got a room full of monkeys to keep the training up to date.

      But for dynamic environments email, usenet, news/weblog rss feeds, knowledge bases, etc., the WhizBang approach, and just about all approaches that rely on sample-based training or handbuilt taxonomies, just doesn't scale.

      But at least you found a job :)
    • This works because resumes have a structure. The structure varies a fair bit and is somewhat vague in implementation, but it is there. Consider the problem akin to finding word breaks in text if you weren't given such things. Obviously a slightly different problem, but the reason we can solve it is because there is structure to what you are looking for. (I bring it up just because that's the problem I'm working on at work)

      Concepts and so forth are far more unstructured. Consider the problem of finding all references to Apple executives. Now you can get part way there with complex queries. But somehow you have to take some information (say executive names gleaned from connection to terms about executives near terms related to the company name) and then use that info to define spaces in a text or information in text to get you further information. That is a much more complex problem than simply tagging text with XML or so forth. The final output might possibly be taggable. However generating that final output involves many intermediate steps that require complex views of both terms and space.

      You end up requiring a way of querying documents so that you can use complex boolean and ranked queries and complex notions about position and space ranges. Thus you might have a complex boolean query that finds all terms with a certain rank (to do fuzzy match or more complex notions of belonging to a set). Then with those results you create a region and then use those regions for further calculations.

      My caveat for all this is that I did work on a project for Lextek International (Lextek.com) that did do all this. So I'm somewhat biased. Probably no one here (given the Open Source nature of things here) would likely be a client. So hopefully I can say all this without anyone thinking I'm just tooting my own horn. Besides - I hardly ever see anything on slashdot I can actually say anything about.

  • This article refers to something called KDD. Knowledge Discovery in Databases [KDD] - as his counterpart Knowledge Discovery in Texts [KDT] - is a whole field of computer sciences. It's been around for more than 20 years now. The ACM even has a Special Interest Group : ACM-SIGKDD [acm.org].
    For an introduction you should read: _Introduction to Machine Learning_ (Kodratoff, Yves; Morgan Kaufmann Pub; 1988) or for a more recent and complete survey: _Advances in Knowledge Discovery and Data Mining_ (Fayaad, Usama; AAAI/MIT Press; 1996).
  • This was my final year project thesis [f9.co.uk]. Just remember the golden rule unstructured 2 structured == convert 2 XML I wrote a [very bad] program in C++/Perl/tcsh IPC=pipes to add XML tags to English, and then index them into a search engine which would use the lingual data stored in the XML tags to help the search.

    NIST does a MASSIVE [nist.gov] competition on this annually. I don't want to be an XML-buzzword whore <Arnold Schwarzenegger accent> (XML commando eats Green berets, C++, Java, Perl, COBOL for breakfast)</Arnold Schwarzenegger accent> but you can't beat XML for easily converting anything that you can make sense out of into computer readable format. Real h3cKoRs use SGML, but us underlings have to stick with things we can understand like XML. As for expandability, if we want to encode something else into the document, then just tag-it-and-go

    It took me 200 hours to fish out all these links (before the Google days), I don't want anyone to have to waste as much time as I did feeding the search engines exotic foods. It's a year old so pardon me for the odd broken link, armed with these you could probably turn jello into XML ;-)

    My favourite bookmarx

    PROJect[21 links]
    Beginners' Guide[13 links]
    Berkeley Linguistics Dept. Course Summaries, general stuff [berkeley.edu]zzzzzzzzzzzzzzCryptic IR Vocabulary defined [chungbuk.ac.kr]
    Explanations of weird words like hypernym [uottawa.ca] zzzzzzzzzzzzzzHow do we produce and understand speech [auckland.ac.nz]
    How Inverted Files are Created - Univeristy of Berkeley [berkeley.edu] zzzzzzzzzzzzzzNLP Univ. of Indiana, very good basics e.g. word sense d [indiana.edu]
    Simple langauge - useful.... [std.com] zzzzzzzzzzzzzzWhat is Natural Language Processing, links [onlineinc.com]
    What is POS tagging........ [uiuc.edu] zzzzzzzzzzzzzzWord Sense Disambiguation defined [adobe.com]
    Word Sense Disambiguation in detail, scroll down far [computists.com] zzzzzzzzzzzzzzWord Sense Disambiguator - LOLITA (tested at MUC-7 and SENSEVAL competition as best) [dur.ac.uk]
    XML for the absolute beginner [javaworld.com]

    HTML, XML stuff + parsers[19 links]
    Apache plug-in that uhhh does stuff with XML [caucho.com] zzzzzzzzzzzzzzConvert COM to XML [reactivesoft.com]
    convert XML, HTML to Unix pipeable formats [ofb.net] zzzzzzzzzzzzzzconverters to and from HTML [hypernews.org]
    expat XML parser [jclark.com] zzzzzzzzzzzzzzHTML Tidy - converts HTML 2 XML + source code!! [w3.org]
    Parse DB (RDBMS, whatever) to XML [synz.com] zzzzzzzzzzzzzzPerl-XML Module List [perlxml.com]
    PHP Manual XML parser functions - what the hell are they talking about, PHP Virtual M... [easydns.com] zzzzzzzzzzzzzzPublic SGML-XML Software [nottingham.ac.uk]
    Pyxie - XML Processor for Python, Perl, etc. [digitome.com] zzzzzzzzzzzzzzSGML+XML tools.org [sgmltools.org]
    The XML Resource Centre - massive number of links [ozemail.com.au] zzzzzzzzzzzzzzW4F wrapper - wrapper converts XML to HTML [upenn.edu]
    XFlat - convert flat file into XML [unidex.com] zzzzzzzzzzzzzzXML Parsers and other XML stuff [sw-technologies.com]
    XML.com - Parsers, etc. [xml.com] zzzzzzzzzzzzzzXML-Data Catalog System - uhhhh looks close [opendialog.com]
    XTAL's general converter - convert anything 2 XML [zeigermann.de]

    other Background[8 links]
    Is Linux ready for the Enterprise, scalable... [winntmag.com] zzzzzzzzzzzzzzLinux reliability [mindcraft.com]
    Linux Versus Windows NT, Mark(sysinternals bloke) [linuxtoday.com] zzzzzzzzzzzzzzPC reliability (pcworld) [pcworld.com]
    SPEC - Standard Performance Evaluation Corp. [specbench.org] zzzzzzzzzzzzzzSystems benchmarks [tbri.com]
    TPC - Transaction Processing Performance Council [tpc.org] zzzzzzzzzzzzzzUnix Beats Back NT In EDA Workstation Arena [techweb.com]
    Proper TREC(-8) QA systems[2 links]

    pg. 387 LIMSI-CNRS pretty deep parsing[2 links]
    More links.... [limsi.fr]
    NLP, IR links - lots to corpii, etc. [limsi.fr]

    pg. 575 U. of Ottawa and NRL (shit system, got 0%)[1 links]
    LAKE Lab [uottawa.ca]
    pg. 607! University of Sheffield (crap system, but OPEN SOURCE!)[2 links]
    GATE - FREE IE app w`source code [shef.ac.uk]
    LaSIE - ER, coreference, template (cv) [shef.ac.uk]

    pg. 617 Univ of Surrey (inconclusive matches)[2 links]
    System Quirk - Or is this their search system..... Hmmmmmm [surrey.ac.uk]
    Univ of Surrey - pointers (hopefully this is their WILDER search system...) [surrey.ac.uk]

    SMU - Pg. 65[1 links]
    Natural Language Processing Laboratory at SMU [smu.edu]

    Textract[2 links]
    Cymfony - Technology [cymfony.com]
    Textract - State of the Art Information Extraction [textract.com]

    Xerox uhhhhh maybe[1 links]
    Xerox Palo Alto Research Center [xerox.com]
    (OVERVIEW) 1999 TREC-8 Q&A Track Home Page [att.com]
    NLP bloke, Univ Sussex [susx.ac.uk]


    Tcl-Tk[4 links] Tcl tutorial [scriptics.com]
    Tcl-Tk Contributed Programs Index [sco.com]
    Tcl-Tk Resources, sources [scriptmeridian.org]
    TclXML - manipulating XML using Tcl-Tk [zveno.com]
    Artificial Natural Language - Is this what I'm trying to parse into... [stanford.edu]
    Comparison of Indexers - Prise vs. Inquery vs. MG, etc. [umd.edu]
    Eagles - Language Engineering Standards [pi.cnr.it]
    Language Technology Group - lots of modules! [ed.ac.uk]
    LDC - Linguistic Data Consortium, lots of corpora [upenn.edu]
    Lexical Resources [tokushima-u.ac.jp]
    Links 2 resources, indexers..... [umd.edu]
    Lots of IR stuff, University of uhhh [upenn.edu]
    Managing Gigabytes Indexer [mu.oz.au]
    Managing Gigabytes Manuals and stuff [rmit.edu.au]
    Htdig search system [htdig.org]
    NLP & IR (NLPIR, NIST) Group [nist.gov]
    OVERVIEW OF MUC-7-MET-2 [saic.com]
    Perl XML Indexing - XML search engine type thing [xml.com]
    Phrasys Language Processing Software Components (money) [phrasys.com]
    QA HCI bullshit [fxpal.com]
    SIGIR - TREC-type thing, resources [acm.org]
    SMART indexer system documentation [pi0959.kub.nl]
    Text REtrieval Conference (TREC) Home Page [nist.gov]
    The Natural Language Software Registry [www.dfki.de]
    Thunderstone IE and IR products [thunderstone.com]
    WordNet - FREE DOWNLOADABLE lexical English database [princeton.edu]

    Page created with URL+ [urlplus.chat.ru], nice utility for working with internet shortcuts
  • I have worked with Discovery Link, it contains wrappers around heterogenous database sources, like Oracle, flat text files and tries to integrate everything into a single representation.

    In life sciences data sources are huge and plentiful. This thing is a monster, it's slow and it needs lots of dedicated people integrating and maintaining it. I'm not even talking about the (IBM) hardware you need for this.

    No, I'm a pragmatic guy. I will integrate on the fly whatever I need to know. The idea is nice and all, but it is unworkable at the moment.

Two can Live as Cheaply as One for Half as Long. -- Howard Kandel

Working...