



Mining Unstructured Data 105
jscribner writes "Data these days tends to an unstructured form, be it text (like the web, email, or books), spoken word, or even in DB's with unique organization (and thus a discrete language). There's a new article on Unstructured Data in Think Research; it's an overview of the challenges, progress, and potential rewards in this area. I'm leaving on your doorstep because, to me, it's a good launching point for discussion of several interesting possibilities: /. as a minable DB of ideas, email identified by interpretation rather than keywords, emotive XML, etc."
They've discovered Google! (Score:2, Insightful)
Re:They've discovered Google! (Score:1)
answer is perl (Score:1, Interesting)
http://www.google.com/search?q=learn+perl+for+h
(remove the silly space in "humanities"
perl and a lot of thinking, that is.
I've been working on a project... (Score:1, Interesting)
Of course, this isn't limited to web-based e-mail, there are parsers to parse web-based forums and bulletin boards, yes even for Slashdot. The unstructured data here can be converted and served via NNTP (NetNews) or some other method.
There's a huge amount of unstructured data and services available on the Web, making these available to computers is a huge step forward in information technology.
Oh Man (Score:1, Insightful)
For instance, Company A is tracking all their data in a Microsoft Word document. Frequently I get asked to dynamically work with this data, and pull it directly out programmatically. I can attest to how difficult this can be sometimes, and I frequently find that upper management doesn't understand the challenges behind pulling unstructured data out.
I definitely recommend this article - especially when trying to explain to your boss why you can't flick your magic wand, and *poof* the data moves from his text file into a database.
Re:Oh Man (Score:1)
Re:Oh Man (Score:2)
And pray your boss hasn't heard of Perl :)
/. as a Turing Test (Score:5, Funny)
A Machine will be considered truly intelligent when it can translate all emails on slashdot into a usable form. Since spammers are some of the most persistent and aggressive users and developers of technology, I expect we'll have real AI telling us how to enlarge our penises by next Thursday.
Re:/. as a Turing Test (Score:1)
"Slashdot as a minable database of ideas..." (Score:4, Funny)
Oooookay.
Sir? Please step away from the bong.
I just spent an ejoyable half hour or so reading Business 2.0's "minable database" of 101 Dumbest Moments in Business, and then I had a look at their even-more-hilarious 100 Dumbest moments in e-Business [business2.com]. This article really does have that weird flavor of megalomaniacal Internet-hype gibberish that we all came to know so well during the boom years. In a way, it's a pleasant little nostalgia trip to see the same old idiocy presented with the same old mindless confidence, but in another way it's just depressing.
Reality Check: Slashdot is a BBS for bored IT workers taking a break while installing nine hundred copies of Word on nine hundred 266 MHz beige boxes at the local credit union. It is not a minable database of ideas (or at least not of ideas worth mining). At its best, it's an undergraduate bull session.
What the hell are you people smoking?
Re:"Slashdot as a minable database of ideas..." (Score:1)
You bastard! Can you imagine what it's going to do to the miner's self-esteem, when it reads your message?
Re:"Slashdot as a minable database of ideas..." (Score:2)
Re:"Slashdot as a minable database of ideas..." (Score:2)
_Sometimes_ unique or semi-unique or thought-provoking ideas get stated. That's the nature of chat rooms and discussion boards. USENET has unique ideas as well and is much more spam-filled and useless looking on first glance.
Re:"Slashdot as a minable database of ideas..." (Score:1)
I would have to HIGHLY disagree... (Score:2)
There are a TON of "Ask Slashdot" articles with very valuable information. I also see in this article many valuable posts (especially that one with the tons of links on machine learning and mining). I also remember some valid ideas bandied about back on the homebrew rollercoaster posting. I also remember seeing information on a posting about mozilla yesterday talking about how to get nice looking fonts in X. Finally, I remember quite some time back (possible up to 2 years ago) an article on AI, in which two individuals, who seemed to know their shit at minimum, and at best were both neuroscientists - arguing about how neurons worked and how the brain "thinks" - most of that went WAAAY over my head, but it was valuable information (or at least it might be a stepping stone).
I see this all the time here on
Slashdot (Score:4, Funny)
Interesting, my mining of hot ideas on Slashdot has determind that a Beowolf Cluster of First Posts is the next big thing...
Doomed Doomed we're all doomed (Score:3, Insightful)
Simply put, if you don't understand what someone is talking about, you can make a reasonable guess and then refine it but you are always making assumptions.
To put in it simple terms for George W. Bush
All Muslims are Terrorists
All Supporters of Militia (McVeigh) are terrorists
Unstructured data is a great way to make money, and a great way to get 80% of the story, the trouble is the other 20% gets destroyed in the process.
Welcome to 1984, and a Brave New World, the minority will cease to count.
Re:Doomed Doomed we're all doomed (Score:2)
Probably it should be randomly structured data, but in any case, the problem still boils down to how you described to, trying to decide what is relevant and how. Other wise you just have a bunch of blobs.
Or else you have a database with links to the random objects (word docs, etc.), but descriptions, etc in the database about the objects. Quick and dirty, but not the best solution.
database of descritpions (Score:2)
which, come to think of it, is what is happening in XML anyhow, you are adding tags in the file instead of having descriptive data outside in the database.
YMMV as far as which method will work better for you.
Re: (Score:2, Interesting)
Re:Semi structured data, likely the way. (Score:1)
However once you have some structures (say basic linguistic units like sentences, words, paragraphs, pages, speakers, etc.) you can then create other ones. From those structures you can then use various techniques to develop more informtion.
Once again, great in theory, complex in practice. However many of the issues used in NLP to understand words can then be expanded for larger units of meaning. Further you can then start to relate various types of contexts. Of course how helpful all this is relates to the type of analysis you are making. Some practical problems are very solvable now. Other problems are more complex.
But consider some future "Google" which indexes pages based not on words but on concept spaces. It then uses other methods, such as the links to a page and so forth, to rank not just pages but concept *spaces* within a page. Finding information would be much, much more helpful.
Re:Doomed Doomed we're all doomed (Score:1)
Nothing is destroyed. The original data is still there. These technologies are best used to summarize and search large information archives (e.g. the web). Does Google destroy any data? No it merely indexes it in a certain way. In fact search services often make data more easily accessible, the opposite of what you are arguing.
Re:Doomed Doomed we're all doomed (Score:1)
Max
Re:Doomed Doomed we're all doomed (Score:1)
Good use of XML (Score:2, Informative)
From the article:
One tool used to corral unstructured data is XML (extensible markup language), which tags salient parts of unstructured electronic documents so they can be searched. The structure of XML documents resembles that of a tree, with branches of tagged information, while relational databases consist of regimented rows. "Being able to produce, accept, store, and search XML provides a little structure to unstructured information," explains Selinger of the Silicon Valley Lab.
This makes a lot of sense. When you think about it, things like images and audio clips can provide some very useful information, but they can be difficult to classify and store in a useful and searchable manner. Having a product or suite of products that would provide the facility to not only classify, but also search the many different types of XML signatures for each type of resource could prove to be a very valuable thing for buisnesses.
Imagine the amount of time that could be saved if you could simply search all of those images/diagrams that you have for different projects, and all of the audio clips from that conferences that you have attended for that key idea that your sure is in there, but just can't remember where!
-ryanRe:Good use of XML (Score:2)
Seems to me like we're coming full circle with OOP and XML - trying to create huge monolithic structures that can handle everything we need to do. Look at Java - everything ultimately is inherited from the almighty Object. XML is no better in this regard, although you can have lots of different XML files describing different pieces of data.
I wonder if the people pushing hierarchal(OOP and XML) data models over relational ones realize that the exact opposite was the case 30 years ago. Perhaps we should have stuck with hierarchal databases in the first place?
Re:Good use of XML (Score:2, Insightful)
As for the Object hierarchy in Java, it really doesn't limit what you can do with the objects and classes... you can still have a class with no data and only static methods, which is just like a function in C. The nice thing about the automatic Object superclass is that it makes generic, heterogenous containers really easy to use.
Re:Good use of XML (Score:2)
Don't mistake a hierarchical type structure for a hierarchical data structure.
In Java, one might model things so that Persons and Vehicles are both subclasses of Object, and that Cars and Trucks are subclasses of vehicles. This is indeed strictly hierarchical.
But a Person called Joe can be the owner for a Truck, ride in a Car, and be the spouse of another person Jane simultaneously. That's not a hierarchical relationship; it's a web of connections.
You can still have hierarchical relationships with OO data; if Joe sells his truck, the Engine and the four Wheels would automatically go along with. But that's just one possible relationship.
Re:Good use of XML (Score:1)
That doesn't sound like a big problem, but it can be when you are using regions to map out new concepts. (i.e. analyze a class of words in all sentences that contain the concept of Apple computer) In practice writing "concepts" to analyze (data mine) texts of this sort is very hard. Further using tools like Perl can be a pain. Yeah you can do it, but you probably won't do it well.
I know that the company Sageware [sageware.com] which I have dealt with does what this article describes. However it supplies various "objects" for mining for concepts. It ends up being tricky stuff which is why mainly large portals use the technology.
The basic notions can apply to Perl or simple C code. Go very complex though and things get messy very quickly.
UK data protection law (Score:2, Interesting)
UK data protection states that copies of email have to be kept for a 28 day minimum period. It advises that "email is a transitory medium" and our company person in charge of such policies has just written a policy that says I'm supposed to program our mail systems to auto-delete mail after a three month period. Staff are supposed to save their emails that they want to keep to their local hard drives, as they suddenly become "documents" rather than emails.
Why? because in the UK any individual can legally ask for copies of any email that mentions them individually by name. Local hard drives can be searched, however this is only if the documents are stored in a "structured filing system". I have raised concerns about what constitutes a "structured filing system" to the point where I would argue that FAT, NTFS and HFS are structured due to the fact they utilise indexes. Add to this the new MS Object Oriented Filing System (OFS) that is basically going to be a simplified version of SQL server as a filing system, is the ability to search previously considered "unstructured" data going to complicate the UK law?
Forward this to the Director of IT, stat! (Score:4, Funny)
Mining /. (Score:1)
Hell, I've really got to stop reading at -1 so much.
--saint
Google Made to Order (Score:3, Informative)
Re:Google Made to Order (Score:2)
Of course if you were in IBM Research, as the authors are, you might have been familiar with The Clever Project [ibm.com] prior to Google. It is explained very nicely here [sciam.com].
I am not saying that the authors might not have been inspired by Google, but I am saying that Google isn't the only possible source of their inspiration.
Re:Google Made to Order (Score:2)
But wait, it was a press release. Submitted by someone from IBM, too.
Re:Google Made to Order (Score:2)
Just to make the conspiracy complete, I am from IBM as well.
Re:Google Made to Order (Score:1)
good title, but mismatched content (Score:3, Interesting)
I just attended the Knowledge Technologies [knowledget...logies.net] conference in Seattle. It's scary how many people think the way to mine unstructured data is to force it into a structure. So many people spending years developing standard taxonomies--different standards, of course. And so many companies (like Semio [semio.com], for example) that want you to develop your own taxonomy. Then you wind up with the very problem this article really discusses.
[Skip next section to avoid my self-promotion]
I'm a big fan of mining unstructured (and differently structured) data by throwing a mining layer on top of it. All of us at Think Tank 23 [thinktank23.com] are. Check out the demo of our technology, Waypoint 2.0 [thinktank23.com], which pulls concepts from unstructured documents, then uses the concepts as the basis for finding relationships between them.
This Is Like Mining Money (Score:5, Funny)
Report: The attached email messages indicate a successful business plan. This simple way to make money fast by selling pamphlets is interpreted as being good: it has been confirmed by many quotes within the email, by repetition in many similar emails, by the suggested calculation of potential return.
Opportunity: There is an unfilled business opportunity which is confirmed by the lack of existing businesses which use this plan. Searches of local and national databases have not found any businesses which are using this method.
Suggestion: Give me a dollar so I can start a business.
Re:Email? (Score:1)
That would make my new e-mail address you-f**king-b*****d-dont-you-come-around-here-agai n-you-c**k-s****r-exclamationpoint-exclamationpoin t.
Somehow I just can't see telling Mom my new e-mail address...
Some interesting technology... (Score:2)
Worked at two starups that do this (Score:2, Insightful)
like tools to tease structured data out of semi
structured web pages and other listings. It's
doable if you limit your scope to one particular
subject, such as job listings. The hardest part
is creating contextual lexicons. Does MS mean
that a master degree is required? The job is
located in Mississippi? Expreince with Microsoft
products is required? The hr contact is Ms.
Smith? You have to figure it out based on
context. Is MS preceded by a city name, that type
of thing.
Nat. Language Understanding != Speech Recognition (Score:3, Informative)
Polymorphic Searching (Score:2, Informative)
Think of the informal knowledge embodied in the emails sent and received, attachments, spreadsheets, favorite websites, your colleagues documents, as well as SQL databases and the like. There simply is no suitably shaped container that you can put amorphous knowledge into. It defies structure, and XML is no answer.
Useful knowledge is of a pervasive nature. It infuses through everything, and often the really useful bits are where you least expect it, so therefore attempting to design a structure, a priori, to hold it is always doomed to failure.
The key here is polymorphic searching of both structured and unstructured data without distinction. That's where products such as ISYS [isysusa.com] earn their salt. The hard part is in convincing the blissfully unaware that knowledge is being wasted in the first place.
The other key concept is value. Large result lists are less useful than small, high-quality result lists. Everybody knows this from using Google and getting back 198,000 hits. In the old CB radio days, it was called a squelch knob. Search engines that just give you large amounts of static do you a dis-service. Useful results are small and targeted.
Re:Polymorphic Searching (Score:2)
It is just not structured according to how the COMPUTER sees it.
Hell this posting of mine right here is structured, and beyond the obvious sentances/paragraphs explanation that is most often given.
Almost all written work is designed as so to allow for the reader to follow along the author's thought process.
Indeed writting could be looked at as some sort of bare level one shot emulation code for the human brain.
Now for computers this makes NO sense at all.
Uh duh, they don't think.
Now with a lot of work native languages can indeed be PARTIALY understood by computers, and there is an artificial language out there (I forget the name) that was designed from the ground up for both comptuer and human understanding on a quasi-equal level. But even so it cannot match the same. . . . underlying meanings between both parties.
Humans are capible of understanding all of the complexities of modern day computers, it may require a lot of work and some darn good wizardy, but it IS possible.
The issue is that the way that computers 'think' is not but a subset of our own thought methods that we have expanded upon and made more complex but ultimatly added nothing new too.
And yet it is by the very nature of being a subset that computer 'thinking' (ugh I hate using that term in this context) can only contain a partial set of the abilities of Human thinking.
Ah, to take a related explanation from Dansdata [dansdata.com]
"But a clever enough algorithmic composition system can get around this, by using a human to direct it through infinite musical space. With any luck, the human will have some idea of what sounds good; that's a really difficult thing to teach a computer."
(speaking about the Kong Karma's composition functions)
Humans have to GUIDE the computer.
For instance the file finder feature on many OSs.
If I tell my Windows box to search for mIRC* it will search my entire computer's hard drive including my Cygwin folder and my C:\corel folder.
Which is obviously highly friggin stupid since mIRC is NOT going to be in either one of those. (well not today at least.
But the COMPUTER does not know that. Despite having a highly refined layout system for my files that has everything compacted into nice small little subsets of subsets as to what types of file it is, the damn computer has;
No idea WTF mIRC is, what IRC is (outside of some sort of program that tells the computer to interpet network packet X with Y evaluation system and display Z depending on X's contents, and oh yah shove the word IRC on the window while your at it. That is ALL computers know of IRC), what the hell a 'program file' is or why in the world (no concept of 'why' either) mIRC would be in C:\program files\
Now if I use a bit of human judgement and direct the computer to search only C:\program files\ it can find the requested files just fine.
But it is STUPID. Period.
What is the BEST possible outcome we can hope for in this situation? Hmm?
Hah. All files in some sort of a database system? Make it 'object based'? Or just add assloads of data to the 'file fork'.
Bah it would STILL come down to the computer going over each friggin entry in a database until it gets a match with the search string. Hell even if some more efficent searching algorithm is used besides just going through every item in the database, the fact is that the computer
(pay attention here folks)
STILL HAS NO FRIGGIN IDEA AS TO WHAT IN THE HELL mIRC is.
I can add descriptors to heck to all files associated with the program. And the computer will STILL NOT KNOW WHAT mIRC IS!
Once again.
THE COMPUTER HAS NO IDEA AS TO WHAT THE HELL ANYTHING IS.
For instance.
I know off hand that my copy of virtual dub is in F:\video editing tools\virtual dub\ (actualy the version number follows it, but close enough.
Now the computer has no idea as to what 'video editing tools' is (I am using is here folks, plural? Huh, whats that? what is 'what'. The computer does not have an understanding of ANY of these topics.)
In fact, one thing that SO many people seem to forget, is that COMPUTERS UNDERSTAND NOTHING.
Nothing AT ALL.
PERIOD.
So please.
Please.
PLEASE
Understand that the computer will NEVER be able to truly organize or structure your data, because the computer does not even know what the hell a structure is. Sure you can tell it to shove such and such bits into such and such places, but it knows not what those bits are or what those bits mean or what those places mean or what the hell a place is or ANYTHING ELSE AT ALL.
I can make my computer feel happy.
I have it show "I am happy" on the screen.
That is as close as you are ever going to get the current breeds of computers to being able to understand or think about anything at all.
Because everything eventualy comes down to that same basic fact.
The computer does what you tell it too and nothing else.
XML as a solution (Score:1)
Isn't XML only part of the solution? I'm pretty sure RDF [w3.org] comes into play somewhere here.
XML won't make it (Score:4, Insightful)
- show a text and find other texts about the same subject.
- hum a tune and tell find an mp3 of the same music.
- show a picture and find other pictures of the same girl.
- better, show a picture of a girl's face and tell your search engine to find nude pictures of the same girl...
Until those simple tasks can be done easily, we will be stuck with the 13500 links one gets when searching for "christina ricci nude" in Google.
Re:XML won't make it (Score:2)
VideoLogger has neato features like speech-to-text, speaker identification, face recognition, and keyframe extraction. All of those things happen in real time, if the PC is fast enough for it.
Combined with a half-decent RDBMS back-end, you can do stuff like search on "Saddam Hussein" and get back a reference to a clip that includes a picture of him, but not his actual name anywhere in the voiceover or the CC data. It's pretty cool.
It's also, like, $60,000 a copy, or something.
No, I don't work for Virage, and I've never had a business relationship with them. I've seen their stuff demoed, though.
Wow! But don't you have to train this system? (Score:1)
Re:Wow! But don't you have to train this system? (Score:2)
Then they ran some news footage through the system that had other pictures of Hussein in it. VideoLogger picked him out and assigned the keyword "Saddam Hussein" to the clip. It got did this on face recognition, not speech or CC recognition, because the video clip was from the Russian TV news!
It was pretty cool, even though it was just a demo.
creative uses (Score:3, Informative)
It makes you wonder how much of this is based on theoretical linguistics [stanford.edu] and formal semantics [mit.edu], and how much is based on good old fashioned statistics [nec.com] and optimization.
Re:creative uses (Score:2)
I can't speak to the work discussed in the original post, but I do know that in the real world a formal linguistics/semantics approach is impractical. These systems require complete or near-complete knowledge structures to work at all. They are brittle, meaning as the world changes they fail to adapt to the changing lexicon. Formal systems are often computationally expensive, and scale poorly to large data sets. The practical problems of constructing and maintaining the formal knowledge structures quickly overwhelms the advantages they have over looser approaches.
So in most cases it is a hybrid of machine learning and statistical techniques that are used in these systems.
Re:creative uses (Score:1)
They are talking about searching (Score:2)
There is a package that is good at displaying unstructured data and letting you see strange patterns since it has tools to find patterns in the data. Its called Partek [partek.com].
E2? (Score:1, Funny)
This can be handled (Score:3, Insightful)
This can be handled, and is handled, by metadata. Most OSes do a limp-wristed version of it every day--"that movie I downloaded a few days ago..."
Natural language grepping through a binary audio file is, no doubt, quite cool, but I believe mostly wasted effort. Well, wasted effort for everybody except IBM, who might sell a few more seats of ViaVoice. I say it's wasted because, most often, it's not the content itself you remember but the circumstances surrounding it. "I saw an article in a magazine, and I read it on the train on the way to Boston--it had something to do with widgets" No ammount of data mining will appropriately pull that info out of a simple text file.
I relate all this in terms of human-interaction, i.e. the computer mining to satisfy the needs of a carbon-based lifeform who regularly purchases Big Macs. Data-mining between computer programs for other computer programs would be a different kettle of fish altogether--and there are a lot of ex-LISP hackers at MIT who would like to know how you got something like that to work, thank you very much.
Oh, and Apple called--they'd like their Knowledge Navigator [dcltd.com] back, please.
KDD (Score:1)
FlipDog uses this for job-hunting (Score:1)
FlipDog [flipdog.com] crawls the web and uses machine-learning technology to extract job listings from companies' web sites. You can then browse through them at their web site, filtering based on location, position category, etc.
The technology was developed by WhizBang Labs [whizbang.com], and is quite cool. They basically take a small set of job listings that their crawler finds, have a human classify parts of the web page (job title, location, description, etc.) and then let their software program loose on it. It analyzes the human-filtered web pages, "learns" how to extract relevant data, and then uses that to classify all of the other crawled pages. Of course, this is over-simplified, but that's the basic idea.
(I'm not affiliated with them, other than successfully finding a job there.)
--Bruce
Re:FlipDog uses this for job-hunting (Score:1)
But for dynamic environments email, usenet, news/weblog rss feeds, knowledge bases, etc., the WhizBang approach, and just about all approaches that rely on sample-based training or handbuilt taxonomies, just doesn't scale.
But at least you found a job
Re:FlipDog uses this for job-hunting (Score:1)
Concepts and so forth are far more unstructured. Consider the problem of finding all references to Apple executives. Now you can get part way there with complex queries. But somehow you have to take some information (say executive names gleaned from connection to terms about executives near terms related to the company name) and then use that info to define spaces in a text or information in text to get you further information. That is a much more complex problem than simply tagging text with XML or so forth. The final output might possibly be taggable. However generating that final output involves many intermediate steps that require complex views of both terms and space.
You end up requiring a way of querying documents so that you can use complex boolean and ranked queries and complex notions about position and space ranges. Thus you might have a complex boolean query that finds all terms with a certain rank (to do fuzzy match or more complex notions of belonging to a set). Then with those results you create a region and then use those regions for further calculations.
My caveat for all this is that I did work on a project for Lextek International (Lextek.com) that did do all this. So I'm somewhat biased. Probably no one here (given the Open Source nature of things here) would likely be a client. So hopefully I can say all this without anyone thinking I'm just tooting my own horn. Besides - I hardly ever see anything on slashdot I can actually say anything about.
Knowledge Discovery in Databases (Score:1)
For an introduction you should read: _Introduction to Machine Learning_ (Kodratoff, Yves; Morgan Kaufmann Pub; 1988) or for a more recent and complete survey: _Advances in Knowledge Discovery and Data Mining_ (Fayaad, Usama; AAAI/MIT Press; 1996).
This was my final year project thesis (Score:2, Informative)
This was my final year project thesis [f9.co.uk]. Just remember the golden rule unstructured 2 structured == convert 2 XML I wrote a [very bad] program in C++/Perl/tcsh IPC=pipes to add XML tags to English, and then index them into a search engine which would use the lingual data stored in the XML tags to help the search.
NIST does a MASSIVE [nist.gov] competition on this annually. I don't want to be an XML-buzzword whore <Arnold Schwarzenegger accent> (XML commando eats Green berets, C++, Java, Perl, COBOL for breakfast)</Arnold Schwarzenegger accent> but you can't beat XML for easily converting anything that you can make sense out of into computer readable format. Real h3cKoRs use SGML, but us underlings have to stick with things we can understand like XML. As for expandability, if we want to encode something else into the document, then just tag-it-and-go
;-)
It took me 200 hours to fish out all these links (before the Google days), I don't want anyone to have to waste as much time as I did feeding the search engines exotic foods. It's a year old so pardon me for the odd broken link, armed with these you could probably turn jello into XML
Re:This was my final year project thesis (Score:1)
Max
Re:This was my final year project thesis (Score:1)
Yeah great, knock yourself out.
Discovery Link (Score:1)
In life sciences data sources are huge and plentiful. This thing is a monster, it's slow and it needs lots of dedicated people integrating and maintaining it. I'm not even talking about the (IBM) hardware you need for this.
No, I'm a pragmatic guy. I will integrate on the fly whatever I need to know. The idea is nice and all, but it is unworkable at the moment.