Text-Mining Your E-mail 229
Misha writes "There have been a number of weeks/months in anyone's life that called for a better organization of your Inbox. filtering and folders work, but it'd be nice to have an text-mining tool running in the background that categorized incoming messages by topic as they arrive. It's nice to see that besides NLP research, there are some great algorithmic advances being done, as seen in this paper. Perhaps even one of them Perl monkeys will quickly hack such a background tool." Note: it's a PostScript file.
PS-PDF Document format conversion (Score:5, Informative)
Yet another reason for.. (Score:4, Informative)
It automagically does full text indexing of all specified databases. To it, your Inbox is just another database.
Re:The ultimate spam blocker? (Score:2, Informative)
SpamAssassin [taint.org] will do this part for you.
Remembrance Agent (Score:5, Informative)
It's more general than e-mail, but in the wearable computing community [blu.org], there's a little application called Remembrance Agent [mit.edu], written by Bradley Rhodes [bradleyrhodes.com] that many folks use. In terms of stand-alone UI, it's still quite primitive, but that's because it was built around dynamic hooks into Emacs.
I've been playing around with some Java-based wrapper code, to wrap the ra-retrieve executable in a Server and allow clients to access the data via sockets. I have a Java-based client coded up that hooks into the System clipboard, but it's still in alpha-mode. All GPL'd of course, but needs a little time to mature. It's a proof-of-concept, work in progress. :-)
Check out Brad's site for more insight into the work he did and is doing.
procmail! [Re:The ultimate spam blocker?] (Score:5, Informative)
First, I sort out mail from the mailingslists I read.
Then, mail from friends, and people I correspond with a lot.
Finally, I have a weighted scoring recipe:
:0 Bh
* -199^0
#Assign an initial value of -199, mail gets filtered, if the score is above 0, at the end of the recipe.
* 50^1 ^(From|To):.*@hotmail.com
* 50^1 ^(From|To):.*@yahoo.com
* 50^1 ^(From|To):.*@aol.com
* 50^1 ^(From|To):.*@msn.com
* 50^1 ^(From|To):.*@excite.com
* 50^1 ^(From|To):.*@netscape.net
* 50^1 ^(From|To):.*@yahoo.co.uk
#Most mail to and from these domains is spam, so score it.
* 100^1 opt-out
* 50^1 opt-in
* 200^1 OTCBB
* 50^1 viagra
* 50^1 zyban
* 50^1 propecia
* 75^1 FREE
* 75^1 GUARANTEED
* 75^1 LEGAL
* 50^2 MILLIONAIRE
* 50^1 100%
#Words I only see in spam.
mail/Trash
This works quite well for me. If any spam gets through, I try to find some words, that I don't get in normal mail, and add them to the scoring.
Re:What I want (Score:2, Informative)
For your second point:
you could switch to using the Maildir format instead of the typical single-file 'mbox' format. Maildir is popularly used by the qmail MTA as well as courier-imap. I run all my email servers in this matter and I've noticed significant speed improvements in mailboxes that have many messages.
Maildir maintains three directories, of which 2 are significant: cur and new. Any new messages delivered into the Maildir mailbox is placed in the "new" directory, once its been read, its moved into the "cur" directory. Each message is its own file, so no speed penalty is invoked for appending messages to mailboxes with many messages. Of course, all these different directories and such are transparent to the end-user, Maildir capable MUAs (for console users) and of course Maildir capable IMAP/POP systems are freely available (qmail does SMTP+sendmail wrapping and includes a basic POP3 daemon; courier-imap does IMAPv4 amongst other things; all the apps lend themselves to be used in an SSL via stunnel environment)
Just a thought ... :)
Since 5.0 it can (Score:3, Informative)
Re:Yet another reason for.. (Score:3, Informative)
Lotus Notes (5.0.5), as installed on my system is 127M (no modem files etc) with 59M in help.nsf files, and my .NSF file and templates area hair over 12M. MS Office is over 160M, without PPT, and that's just the Program Files\Microsoft Office directory.
Lotus Notes is pretty clean, so most of it's files are in 1 directory, not spread out over umpteen directories like Office.
News for Nerds (Score:2, Informative)
Why are there so many people complaining about a PS link? The answer is simple: ./ is news for nerds, not for geeks.
So while the average geek keeps his favorite postscript viewer handy, the standart nerd wonders about such an ancient format and does not know how to feed his acrobat viewer with it...
Here is the solution for those irritated ones: try this [wisc.edu] piece of ancient software on the ancient adobe format, and you can miracously view it's contents!
Have fun and keep your google handy!
VM & EMACS (Score:3, Informative)
Re:Procmail and Glimpse (Score:3, Informative)
- install spamassasin, and add it to your procmail rules to sort spam hits into a "spam" folder.
- Install a nice IMAP system, like courier-imap.
- install apache/php, get ssl to work with a "fake" certificate.
- install squirrelmail (squirrelmail.org) and point it at your imap instance to get nice, easy encrypted webmail from anywhere.
- get a dyndns domain, or buy one and have zoneedit.com host it for free (with dynamic ip support)
- ssh in for old-school mutt action, or use the webmail and it's built in search function.
- for extra credit, add fetchmail to fetch in all your various other accounts into this system, sorted by procmail into the right place.
- for extra extra credit, hack gotmail to fetch your hotmail, spamfilter it, and sort it.
I've done all this, it works, and it ROCKS.
Re:What I want (Score:5, Informative)
(jfb)
Done already (Score:5, Informative)
Been done already. Check out Mail::Miner [cpan.org].
finding NEW topics (Score:2, Informative)
The main result in Kleinberg's paper relates to finding NEW topics that start to appear in the stream. Let's say you already have categorization filters (procmail, keyword filters, your own set of folder hierarchies, whatever...), but there's a new topic that starts showing up in your mail, or in your newsgroup feed, or on CNN. Klienberg's result is a way to find that the new stuff really is NEW, and you might want to group it up together, and make a folder for it. You could do that automatically, or by hand, but first you have to know that there's a topic.
there's a bunch of other work in this area, what the NLP types call TDT -- "Topic Detection and Tracking" [google.com]
Re:procmail! [Re:The ultimate spam blocker?] (Score:4, Informative)
Or you could just use SpamAssassin [taint.org], which is designed specifically to do this and has many more rules that have been created by others.
--Bruce
Not new, but cool. (Score:3, Informative)
There's been lots of work on auto-classifying email. I did my semester project in Machine Learning on this in 1999. It's a fairly simple study, but it seems like a Naive Bayesian classifier using word counts as features does a pretty decent job of classifying email, and does really well on spam.
The paper is here here [utexas.edu].
J.