Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×
News Science

Text-Mining Your E-mail 229

Misha writes "There have been a number of weeks/months in anyone's life that called for a better organization of your Inbox. filtering and folders work, but it'd be nice to have an text-mining tool running in the background that categorized incoming messages by topic as they arrive. It's nice to see that besides NLP research, there are some great algorithmic advances being done, as seen in this paper. Perhaps even one of them Perl monkeys will quickly hack such a background tool." Note: it's a PostScript file.
This discussion has been archived. No new comments can be posted.

Text-Mining Your E-mail

Comments Filter:
  • by Misha ( 21355 ) on Wednesday April 24, 2002 @02:00PM (#3402601) Homepage
  • by Dr Caleb ( 121505 ) on Wednesday April 24, 2002 @02:02PM (#3402623) Homepage Journal
    Lotus Notes.

    It automagically does full text indexing of all specified databases. To it, your Inbox is just another database.

  • by jmb-d ( 322230 ) on Wednesday April 24, 2002 @02:16PM (#3402724) Homepage Journal
    This would be an awesome tool to block spam. If this program could look at the text of an email message and determine that it is a solicitation of some kind

    SpamAssassin [taint.org] will do this part for you.
  • Remembrance Agent (Score:5, Informative)

    by Tekmage ( 17375 ) on Wednesday April 24, 2002 @02:19PM (#3402749) Homepage

    It's more general than e-mail, but in the wearable computing community [blu.org], there's a little application called Remembrance Agent [mit.edu], written by Bradley Rhodes [bradleyrhodes.com] that many folks use. In terms of stand-alone UI, it's still quite primitive, but that's because it was built around dynamic hooks into Emacs.

    I've been playing around with some Java-based wrapper code, to wrap the ra-retrieve executable in a Server and allow clients to access the data via sockets. I have a Java-based client coded up that hooks into the System clipboard, but it's still in alpha-mode. All GPL'd of course, but needs a little time to mature. It's a proof-of-concept, work in progress. :-)

    Check out Brad's site for more insight into the work he did and is doing.

  • by Styx ( 15057 ) on Wednesday April 24, 2002 @02:36PM (#3402872) Homepage
    I use procmail, with weighted scoring [columbia.edu]
    First, I sort out mail from the mailingslists I read.
    Then, mail from friends, and people I correspond with a lot.
    Finally, I have a weighted scoring recipe:

    :0 Bh
    * -199^0
    #Assign an initial value of -199, mail gets filtered, if the score is above 0, at the end of the recipe.
    * 50^1 ^(From|To):.*@hotmail.com
    * 50^1 ^(From|To):.*@yahoo.com
    * 50^1 ^(From|To):.*@aol.com
    * 50^1 ^(From|To):.*@msn.com
    * 50^1 ^(From|To):.*@excite.com
    * 50^1 ^(From|To):.*@netscape.net
    * 50^1 ^(From|To):.*@yahoo.co.uk
    #Most mail to and from these domains is spam, so score it.
    * 100^1 opt-out
    * 50^1 opt-in
    * 200^1 OTCBB
    * 50^1 viagra
    * 50^1 zyban
    * 50^1 propecia
    * 75^1 FREE
    * 75^1 GUARANTEED
    * 75^1 LEGAL
    * 50^2 MILLIONAIRE
    * 50^1 100%
    #Words I only see in spam.
    mail/Trash

    This works quite well for me. If any spam gets through, I try to find some words, that I don't get in normal mail, and add them to the scoring.

  • Re:What I want (Score:2, Informative)

    by Jobe_br ( 27348 ) <bdruth@gmailCOUGAR.com minus cat> on Wednesday April 24, 2002 @02:38PM (#3402883)

    For your second point:

    An email box that automatically ages the files effectively archiving them. Some of my mail folders/files are huge now and it takes too long to append them when new mail arrives.

    you could switch to using the Maildir format instead of the typical single-file 'mbox' format. Maildir is popularly used by the qmail MTA as well as courier-imap. I run all my email servers in this matter and I've noticed significant speed improvements in mailboxes that have many messages.

    Maildir maintains three directories, of which 2 are significant: cur and new. Any new messages delivered into the Maildir mailbox is placed in the "new" directory, once its been read, its moved into the "cur" directory. Each message is its own file, so no speed penalty is invoked for appending messages to mailboxes with many messages. Of course, all these different directories and such are transparent to the end-user, Maildir capable MUAs (for console users) and of course Maildir capable IMAP/POP systems are freely available (qmail does SMTP+sendmail wrapping and includes a basic POP3 daemon; courier-imap does IMAPv4 amongst other things; all the apps lend themselves to be used in an SSL via stunnel environment)

    Just a thought ... :)

  • Since 5.0 it can (Score:3, Informative)

    by barzok ( 26681 ) on Wednesday April 24, 2002 @02:39PM (#3402889)
    Message rules are very easy to set up and manage. No agents.
  • by Dr Caleb ( 121505 ) on Wednesday April 24, 2002 @02:49PM (#3402948) Homepage Journal
    How do you figure that?

    Lotus Notes (5.0.5), as installed on my system is 127M (no modem files etc) with 59M in help.nsf files, and my .NSF file and templates area hair over 12M. MS Office is over 160M, without PPT, and that's just the Program Files\Microsoft Office directory.

    Lotus Notes is pretty clean, so most of it's files are in 1 directory, not spread out over umpteen directories like Office.

  • News for Nerds (Score:2, Informative)

    by lydon ( 26705 ) on Wednesday April 24, 2002 @03:05PM (#3403045) Homepage

    Why are there so many people complaining about a PS link? The answer is simple: ./ is news for nerds, not for geeks.

    So while the average geek keeps his favorite postscript viewer handy, the standart nerd wonders about such an ancient format and does not know how to feed his acrobat viewer with it...

    Here is the solution for those irritated ones: try this [wisc.edu] piece of ancient software on the ancient adobe format, and you can miracously view it's contents!

    Have fun and keep your google handy!

  • VM & EMACS (Score:3, Informative)

    by pmz ( 462998 ) on Wednesday April 24, 2002 @03:09PM (#3403079) Homepage
    I have enjoyed using the VM module for Emacs. It allows sorting your entire Inbox into separate categorized mail boxes via regular expressions. Basically with one shift-A keystroke, my entire day's worth of mailing list stuff gets whisked away into a half-dozen different files. After this, I feel really sorry for people trapped in the Outlook dungeons!
  • by nehril ( 115874 ) on Wednesday April 24, 2002 @03:28PM (#3403267)
    leave your procmail system in place.
    - install spamassasin, and add it to your procmail rules to sort spam hits into a "spam" folder.
    - Install a nice IMAP system, like courier-imap.
    - install apache/php, get ssl to work with a "fake" certificate.
    - install squirrelmail (squirrelmail.org) and point it at your imap instance to get nice, easy encrypted webmail from anywhere.
    - get a dyndns domain, or buy one and have zoneedit.com host it for free (with dynamic ip support)
    - ssh in for old-school mutt action, or use the webmail and it's built in search function.
    - for extra credit, add fetchmail to fetch in all your various other accounts into this system, sorted by procmail into the right place.
    - for extra extra credit, hack gotmail to fetch your hotmail, spamfilter it, and sort it.

    I've done all this, it works, and it ROCKS.

  • Re:What I want (Score:5, Informative)

    by nosferatu-man ( 13652 ) <spamdot@homonculus.net> on Wednesday April 24, 2002 @03:28PM (#3403276) Homepage
    Welcome to Gnus [gnus.org]. Have a sandwich.

    (jfb)
  • Done already (Score:5, Informative)

    by Matts ( 1628 ) on Wednesday April 24, 2002 @03:43PM (#3403452) Homepage
    "Perhaps even one of them Perl monkeys will quickly hack such a background tool."

    Been done already. Check out Mail::Miner [cpan.org].
  • finding NEW topics (Score:2, Informative)

    by tswaterman ( 575957 ) on Wednesday April 24, 2002 @04:47PM (#3404286)
    Many of these comments are missing the point. The paper is not really about categorizing your email.

    The main result in Kleinberg's paper relates to finding NEW topics that start to appear in the stream. Let's say you already have categorization filters (procmail, keyword filters, your own set of folder hierarchies, whatever...), but there's a new topic that starts showing up in your mail, or in your newsgroup feed, or on CNN. Klienberg's result is a way to find that the new stuff really is NEW, and you might want to group it up together, and make a folder for it. You could do that automatically, or by hand, but first you have to know that there's a topic.

    there's a bunch of other work in this area, what the NLP types call TDT -- "Topic Detection and Tracking" [google.com]

  • by bruckie ( 217355 ) <slashdot@brucec.net> on Wednesday April 24, 2002 @05:22PM (#3404627) Homepage

    Or you could just use SpamAssassin [taint.org], which is designed specifically to do this and has many more rules that have been created by others.

    --Bruce

  • Not new, but cool. (Score:3, Informative)

    by jefferson ( 95937 ) on Wednesday April 24, 2002 @05:42PM (#3404809) Homepage

    There's been lots of work on auto-classifying email. I did my semester project in Machine Learning on this in 1999. It's a fairly simple study, but it seems like a Naive Bayesian classifier using word counts as features does a pretty decent job of classifying email, and does really well on spam.

    The paper is here here [utexas.edu].

    J.

Anyone can make an omelet with eggs. The trick is to make one with none.

Working...