Forgot your password?
typodupeerror
News Science

Text-Mining Your E-mail 229

Posted by Hemos
from the doing-the-research dept.
Misha writes "There have been a number of weeks/months in anyone's life that called for a better organization of your Inbox. filtering and folders work, but it'd be nice to have an text-mining tool running in the background that categorized incoming messages by topic as they arrive. It's nice to see that besides NLP research, there are some great algorithmic advances being done, as seen in this paper. Perhaps even one of them Perl monkeys will quickly hack such a background tool." Note: it's a PostScript file.
This discussion has been archived. No new comments can be posted.

Text-Mining Your E-mail

Comments Filter:
  • by Anonymous Coward
    They'll end up finding a loophole in your filtering, or you'll end up filtering out real emails.

    Only way to win is to kill it from the source. End of story.
    • Doesn't this effectively amount to censorship? I agree that we (as individuals) need to make spamming less cost effective, but just preventing people from emailing (which is what you'd have to do) is censorship, unless of course you're the govt, in which case it's "protecting national interests"...
      • Hoo boy. Here we go again. When are you kids going to get it straight?

        - Choosing not to listen to somebody is *not* censorship.
        - Throwing your mail away before you open it is *not* censorship.
        - Choosing not to relay somebody's spam is *not* censorship.
        - Choosing not to broadcast somebody's TV program, even if you own a TV network, is *not* censorship.
        - Telling a movie producer you won't distribute his/her movie unless he/she makes cuts or changes to the subject matter is *not* censorship.
        - Rallying your church group together to burn books is *not* censorship.
        - Refusing to sell certain magazines or newspapers, if you own a newsstand, is *not* censorship.

        The only way somebody can be truly "censored" is when there is no legal means for that person to get his/her speech/art/etc. produced and disseminated to the pubic. Generally speaking, the only body with that type of power is the government -- because they make the laws.

        Everything else is merely an inconvenience. It may piss you off, sure, and you may wish things were different. But you can't force people to support you, encourage you, or fund you if they just don't want to. For example, people in this country (the US) *do* have a right to decide what material constitutes pornography, relative to their local community standards -- and if you don't like it, you are within your rights to move to another town.

        "No censorship" does not mean being forced to look at every piece of crap that somebody wants to throw in your face, and god help us if it did.
        • Re:Censorship? (Score:2, Interesting)

          by alouts (446764)
          Very valid points but:

          The post you're ranting against was a reply to one that suggests filtering is not what we should do. That spam needs to be "killed at the source". Which means legally preventing someone from creating any mail in the first place.

          Say what you will about spammers, but that IS censorship.

          ('Course there's plenty of people here who believe that censorship is fine in this case, but that's not what you're arguing, so I won't either.)

          • It is only censorship if "killed at the source" means to literally kill the fucknut sending the spam. Death is the most convienient and widely implemented form of censorship in places like China.

            Preventing someone from sending emails is NEVER censorship by definition. They can always go to Kinkos and make plain old paper mailings and then mail them to everyone on the planet.

            t.

    • Spam filtering is one possible application of this type of tool, but the more useful involves taking the mail you *do* want, and sorting it into logical buckets. For instance, let's say work on several open source projects, belong to a couple organizations, and have a real-life job. You could toss a filter in your email that scans each incoming message and throws it in the proper bucket. This allows you to logically separate your mail to reduce confusion of each non-overlapping category.

      Procmail only goes so far, it's really only useful for simple header scanning.. I could really see a good scanner utility being a valuable tool. Maybe Google should share some of their technology.. :-)
      • Dynamic folders or views of your email would be a Wonderful Thing.

        I can't say how constraining it is to have statically defined folders which I have to move mail into based on my selection.

        Procmail helps to do this dynamically based on simple criteria, but when you want to have a particular piece of email show up in multiple views without having multiple copies, it really calls for associating named "views" of the whole mess with specific search and sorting criteria.

        That way, one view is "Latest Unread Messages" which has a particular message in it that might also show up in "Most Recent Messages about Project X" and in "Most Recent Messages from Boss".

        I'd love to have my email client show multiple views this way.

  • What I want (Score:2, Funny)

    by clion999 (565741)
    Here's to the researchers. I would like:

    * An email box that lets me extract the threads with my friends.
    * An email box that automatically ages the files effectively archiving them. Some of my mail folders/files are huge now and it takes too long to append them when new mail arrives.

    Yes, I realize I should get off my butt and do this, but it's faster to post on slashdot.

    • In fact, at least point 2 can be easily realized using mutt.
    • Use nmh. Messages are stored in separate files rather than an entire folder in one file. You can then auto-archive by date with something like:

      refile `pick +inbox -before '1 apr 2002'` -src +inbox +archive
      • Use nmh. Messages are stored in separate files rather than an entire folder in one file. You can then auto-archive by date with something like: refile `pick +inbox -before '1 apr 2002'` -src +inbox +archive
        Yeah, I was wondering a bit about what "text mining" your email is supposed to be about exactly...

        Personally, I use mh (using the emacs mh-rmail frontend). I refile stuff automatically typically just based on the '-from' (using commands much like the above pick/refile). And if I'm looking for something I remember seeing awhile back, a grep on one or two mail folders (which are just directories full of text files for us mh users) does a pretty good job...

        I won't say that there's no way to improve on this, but any fancy system that someone proposes has got to beat some pretty effective simple tools...

        I mean, if you're really after identifying a burst of activity on a given topic... wouldn't a combination of text searches and visual scans of subject headers sorted by date get you 90% of the way there?

        While we're on the subject, anyone taken a look at this old jwz idea: Intertwingle [mozilla.org]

    • Re:What I want (Score:2, Informative)

      by Jobe_br (27348)

      For your second point:

      An email box that automatically ages the files effectively archiving them. Some of my mail folders/files are huge now and it takes too long to append them when new mail arrives.

      you could switch to using the Maildir format instead of the typical single-file 'mbox' format. Maildir is popularly used by the qmail MTA as well as courier-imap. I run all my email servers in this matter and I've noticed significant speed improvements in mailboxes that have many messages.

      Maildir maintains three directories, of which 2 are significant: cur and new. Any new messages delivered into the Maildir mailbox is placed in the "new" directory, once its been read, its moved into the "cur" directory. Each message is its own file, so no speed penalty is invoked for appending messages to mailboxes with many messages. Of course, all these different directories and such are transparent to the end-user, Maildir capable MUAs (for console users) and of course Maildir capable IMAP/POP systems are freely available (qmail does SMTP+sendmail wrapping and includes a basic POP3 daemon; courier-imap does IMAPv4 amongst other things; all the apps lend themselves to be used in an SSL via stunnel environment)

      Just a thought ... :)

    • I know this is the wrong place to point this out, but Oulook does what you are asking for.

      You can sort a folder by user/subject/date, and there is a built in thread view. You can also use the autoarchive feature, or manually archive messages in X folder(s) older than Y date.
    • Re:What I want (Score:5, Informative)

      by nosferatu-man (13652) <spamdot@homonculus.net> on Wednesday April 24, 2002 @02:28PM (#3403276) Homepage
      Welcome to Gnus [gnus.org]. Have a sandwich.

      (jfb)
    • Here's what I want:

      A google plug-in for my mail client.

      Thanks in advance!
    • gnus [gnus.org] has been doing this for years... as well as other neat things like mail scoring (similar to news scoring) so that mail you don't want to read gets filtered to the bottom of your list or (if you tell it to) doesn't even show up at all. Similarly, mail that you most want to read (based on past response) gets bubbled up to the top. gnus also supports mail expiry (once again, similar to news) so that old mail gets Handled(TM).
  • by Misha (21355) on Wednesday April 24, 2002 @01:00PM (#3402601) Homepage
    • If you run linux ps2pdf works nicely as well.
    • PS-PDF is great for quickly mirroring webpages. I'm suprised that I don't see more people doing it here on slashdot to get some quick karma when sites get slashdotted. You have the webpage open in your browser (because you got there before the crowd). First you print it to a postscript file (netscape does this nicely). Then you run it through ps2pdf or some other tool like this and you have have the webpage (with all the pictures) mirrored in a single file. My friends were doing this on sept 11 when all the news sites were going down. Anything one of us saw, we all saw.
      • If using Mac OS X, one could do the same thing in any printing-capable browser (or any other program). Use the Print command and click on the "Preview" button in the dialog. This automatically creates a PDF version of the document, which can be saved and uploaded.
    • by Anonymous Coward
      Just use GhostView...
  • I'm sure I'm not alone in saying that having a good history of well filtered incoming, and especially just about all of my Outgoing (Outbox) available for searching. My Outbox has been a lifesaver several times when someone claims that they didn't have that (electronic) discussion with me. It's great to quote "in a message sent... ...I asked you to...".
  • That feature in the description is not text mining, just filtering.
  • by Dr Caleb (121505) <thedarkknight@hu ... il.com minus cat> on Wednesday April 24, 2002 @01:02PM (#3402623) Homepage Journal
    Lotus Notes.

    It automagically does full text indexing of all specified databases. To it, your Inbox is just another database.

    • That's not the point. The paper is talking about modeling spikes in topic/content of data streams over time. This is the second layer analysis of the meta-data that gets stored in the database.
    • Upside:

      Lotus Notes does all kinds of things automagically.

      Downside:

      It's _Lotus Notes_, the application that makes Microsoft Office look lean and mean.

      • How do you figure that?

        Lotus Notes (5.0.5), as installed on my system is 127M (no modem files etc) with 59M in help.nsf files, and my .NSF file and templates area hair over 12M. MS Office is over 160M, without PPT, and that's just the Program Files\Microsoft Office directory.

        Lotus Notes is pretty clean, so most of it's files are in 1 directory, not spread out over umpteen directories like Office.

        • All you have to do is use Lotus Notes for a few days on an aging PowerMac with 8 or 16MB ram, and you'll give up on it forever. You'll also tell everyone you know 1. what a horrible thing Lotus Notes is, and 2. what a horrible thing a Macintosh is.

          Those at Apple responsible for allowing PowerMacs to ship with System 7.5.x and less than 32MB ram should be banned from the industry. When an OS by default takes more ram than a system has, and is coupled with an application like Lotus Notes, which is hungry, nothing good can ever happen.

          This is, IMNSHO, a good part of the reason that so many corporations ditched their Macs in the mid-ninteys.
  • This would be an awesome tool to block spam. If this program could look at the text of an email message and determine that it is a solicitation of some kind and then drop it into an email "pit" (you know, a folder mapped to /dev/null), that would make my life a LOT easier...
    • SpamAssassin does this already, using a genetic algorithm.
    • This would be an awesome tool to block spam. If this program could look at the text of an email message and determine that it is a solicitation of some kind

      SpamAssassin [taint.org] will do this part for you.
    • by Styx (15057) on Wednesday April 24, 2002 @01:36PM (#3402872) Homepage
      I use procmail, with weighted scoring [columbia.edu]
      First, I sort out mail from the mailingslists I read.
      Then, mail from friends, and people I correspond with a lot.
      Finally, I have a weighted scoring recipe:

      :0 Bh
      * -199^0
      #Assign an initial value of -199, mail gets filtered, if the score is above 0, at the end of the recipe.
      * 50^1 ^(From|To):.*@hotmail.com
      * 50^1 ^(From|To):.*@yahoo.com
      * 50^1 ^(From|To):.*@aol.com
      * 50^1 ^(From|To):.*@msn.com
      * 50^1 ^(From|To):.*@excite.com
      * 50^1 ^(From|To):.*@netscape.net
      * 50^1 ^(From|To):.*@yahoo.co.uk
      #Most mail to and from these domains is spam, so score it.
      * 100^1 opt-out
      * 50^1 opt-in
      * 200^1 OTCBB
      * 50^1 viagra
      * 50^1 zyban
      * 50^1 propecia
      * 75^1 FREE
      * 75^1 GUARANTEED
      * 75^1 LEGAL
      * 50^2 MILLIONAIRE
      * 50^1 100%
      #Words I only see in spam.
      mail/Trash

      This works quite well for me. If any spam gets through, I try to find some words, that I don't get in normal mail, and add them to the scoring.

  • "that categorized incoming messages by topic as they arrive." - you can already sort messages into different folders depending on their topic by setting up rules.
  • by abucior (306728) on Wednesday April 24, 2002 @01:10PM (#3402672)
    Personally, I'd prefer that I simply get less email. The fact that we need NLP tools to pre-screen our email for us just shows how information-overloaded our society has become. What I really need is a tool at the sender's end that can pre-screen my email and tell the sender "Don't send this. He just doesn't care!"
    • There's plenty of information I want to get that I don't want to look at as email.

      For example, I'd like to get messages inviting me to events I'm unlikely to go to, and I'd like to have their dates get marked down so that I can see what is happening on a given day if I feel like doing something.

      I'd like to get new addresses for people, but I want to have my addressbook updated instead of seeing the message.

      It would be really convenient to have software that would figure out this sort of information from a human-readable message, since people are likely to want to send it in natural language (and the message probably includes more information that I might want to see if I decide I care.

  • I can sort reports from devices, co-workers, clients....each goes in its own folder....
  • IMAP [imap.org] (Internet Message Access Protocol) was designed to centralize email information, I believe. If stored/implemented with a database, what more would you need ?

    I think querying through SQL would satisfy most of us.. and be very useful in corporate environments (for example, query all email sent from a user to support), and it's already done by some projects like DBMAIL [freshmeat.net].

    Anybody out there with experience using these ?

    BTW, there's an extensive database of IMAP products [imap.org] including some that make the data accessible via LDAP... hours of fun!
    • by statusbar (314703) <jeffk@statusbar.com> on Wednesday April 24, 2002 @01:48PM (#3402942) Homepage Journal
      DBMAIL looks cool, once it supports postgresql it would be awesome.

      I have been dissapointed in general with most SMTP, IMAP and POP servers. A real database is the proper way to do things. Email is my #1 app and I want to do complex queries on my archives.

      So last year I bit the bullet and wrote a 50 line python program which imported all my mbox and Maildir format archives into a simple postgresql database. 600 megs worth over the last 4 years.

      And another simple 50 line php program gives me a web database query interface. It suits my needs now and is much faster than searching through a big (but much much smaller) imap folder with almost every mail program I've tried. With some good design it really shouldn't be too hard to make an industrial strength email database system and I am surprised that it hasn't happened sooner in the open source world.

      I think that direct SQL access to the mail database is preferred over IMAP. SQL gives you more capabilities and I find it less problematic than all the various combinations of IMAP servers and mail programs.

      Jeff
  • look (Score:4, Funny)

    by Joe the Lesser (533425) on Wednesday April 24, 2002 @01:14PM (#3402706) Homepage Journal
    Now we all now that most email is delivered promptly by gremlins, but gremlins are hungry and will eat a few bytes here and there.

    They also leave waste in the form of spam.

    So, I propose that we turn to gnomes to deliver the mail instead, as they are much cleaner, and can be satiated by attaching a file like 'Hamburger.txt'.
  • by CaptainPhong (83963) on Wednesday April 24, 2002 @01:15PM (#3402709) Homepage
    I've found the most joy from owning my own domains, and a lot of it has to do with e-mail sorting/filtering as much as the traditional benefits (a permanent www.yourdomain.com web site address and yourname@yourdomain.com e-mail address).

    Every time you sign up for some mailing list or discussion group, create a new e-mail account or alias for just those mailings. Bam, it's automatically sorted out by itself with extreme ease. If you have limited bandwith (or are checking, say, on your palm) sometimes, just check your important addresses frequently, and reserve your mailing lists for a once-per-day check.

    If some site asks for your e-mail address to download a piece of software, or to register, make up a new alias and give that to them. If you start getting tons of crap at that address, you can just remove that alias, and they get it all bounced back in their stupid spamming faces.

    Give one address to your cow-orkers just for work stuff. Give a different one to your Mom and other techno-nots that blocks all attachments. Give another one to your friends with brains that goes unfiltered. For people you don't want to talk to, give them the address of an autoresponder tied to Eliza [fury.com].

    Be a *Happy Camper* and let your addresses be *Bubbles* and you be just *You*.

  • Remembrance Agent (Score:5, Informative)

    by Tekmage (17375) on Wednesday April 24, 2002 @01:19PM (#3402749) Homepage

    It's more general than e-mail, but in the wearable computing community [blu.org], there's a little application called Remembrance Agent [mit.edu], written by Bradley Rhodes [bradleyrhodes.com] that many folks use. In terms of stand-alone UI, it's still quite primitive, but that's because it was built around dynamic hooks into Emacs.

    I've been playing around with some Java-based wrapper code, to wrap the ra-retrieve executable in a Server and allow clients to access the data via sockets. I have a Java-based client coded up that hooks into the System clipboard, but it's still in alpha-mode. All GPL'd of course, but needs a little time to mature. It's a proof-of-concept, work in progress. :-)

    Check out Brad's site for more insight into the work he did and is doing.

  • by Col. Panic (90528) on Wednesday April 24, 2002 @01:21PM (#3402764) Homepage Journal
    my $pr0n = "adult";
    my $spam = "viagra";
    my $urgent = "penis enlargement";
    open (INBOX,/home/mail) or die "Damn! No fun for me:$!\n";
    @list = readdir(INBOX);

    foreach $ (@list) {
    if (-f $spam) {
    my $status = unlink($spam);
    }
    if (-f $pr0n) {
    my @MUST_SEE = $pr0n;
    next;
    }
    if (-f $viagra) {
    my @RAINY_DAY = $viagra;
    next;
    }
    }
    # or something like that ...

  • OK, it's not a piece of Linux software, but it is a beautful idea:

    http://www.creo.com/sixdegrees/
  • Finally... (Score:2, Funny)

    by Aiku1337 (551438)
    Now I can automatically filter my barely-legal porn spam from my anime porn spam. Lets hear it for technology =)
  • Postscript document (Score:3, Interesting)

    by Tim Ward (514198) on Wednesday April 24, 2002 @01:30PM (#3402836) Homepage
    Somewhat to my astonishment when I clicked on the link up popped a box asking me to confirm Postscript Renderer options! I had no idea that I had anything on this box that could read Postscript.

    Some minutes of 100% CPU later up pops a PSP window, with the document rendered in a font about five pixels square. Fair enough, I suppose, for what's basically a photograph editing application.

    But really, how bizarre, posting something in a low level printer file format. We'll have people posting documents in PCL5 next.
    • But really, how bizarre, posting something in a low level printer file format. We'll have people posting documents in PCL5 next.

      What's so strange about it? Postscript has the great advantage that it's actually designed to describe exactly what's on the page. That lets you produce very nicely formatted documents that will render exactly the same way on any computer, which makes it the output format of choice for programs like TeX. It's great because it's easy to print, so people who prefer to see things in dead tree format can do so easily. It can be processed into PDF very easily, too, so people who like PDFs won't have any problems. Sounds like a good choice to me.

      • .ps generally sounds like a good idea to many science-types.

        I think it's rather tiring.

        If I didn't have a full install of Acrobat on my system, I wouldn't have bothered with it. (It configured itself to handle .ps documents by converting them into .pdf.)

        .pdf has been around for as long as the commercial Internet, and is understood by every computer I've used in the past five years. It can be created by innmuerable commercial and free (as in beer and as in speech) tools. It can be read by Acrobat reader, a fantastic free (as in beer) tool from Adobe.

        There really are no reasons to publish in .ps other than whim, eliteism, or ignorance. All of those being sins in my book.
        • The reason is ignorance but not on the part of the publishers.

          Acrobat is shit.

          ggv will view .ps, .pdf, .ps.bz2, .ps.gz, probably others. Works great. There is no reason to differentiate between any of them. And if you really must ps2pdf works quite well.

          t.

  • by Ezubaric (464724) on Wednesday April 24, 2002 @01:33PM (#3402854) Homepage
    If this works, you could modify the sorting so that e-mails from your higher-ups would get prompt replies even while you're trolling on /.

    Boss emails:

    1) What's the status of your project?
    I just pulled my third all-nighter, sir, but we're always making progress. I would say we're int(date.day()/32*rand(1))% done right now. Go team!

    2) We have a meeting at (time) about (topic).
    Hey - I just got your e-mail. I was on the VP's (topic) steering board last year, so I'm really interested, but I have a conference call at (time)+(0:30), so I'm afraid I can't make it. Could you please send me the minutes if you get a chance, though?

    3) Everything else
    I have pictures. You know of what. Never e-mail me again or they're going to your mother and your spouse.

  • by Splork (13498)
    i used to use glimpse to index my mail (all stored in mh folders; though maildir would also work). very easy searches that way.
  • Procmail and Glimpse (Score:2, Interesting)

    by Aknaton (528294)
    I'm in the middle of setting up a NetBSD box for email archiving and here is how I hope/intend to set up my e-mail:

    1. Sort all incoming mail with Procmail and use Mutt to move remaining unsorted items where they belong.

    2. Run Glimpse periodically via Crontab so that I can easily/quickly do string searches though the resulting mbox files.

    (If anyone sees a problem with this idea, or a way to improve this idea, please reply to this post!)
    • by nehril (115874)
      leave your procmail system in place.
      - install spamassasin, and add it to your procmail rules to sort spam hits into a "spam" folder.
      - Install a nice IMAP system, like courier-imap.
      - install apache/php, get ssl to work with a "fake" certificate.
      - install squirrelmail (squirrelmail.org) and point it at your imap instance to get nice, easy encrypted webmail from anywhere.
      - get a dyndns domain, or buy one and have zoneedit.com host it for free (with dynamic ip support)
      - ssh in for old-school mutt action, or use the webmail and it's built in search function.
      - for extra credit, add fetchmail to fetch in all your various other accounts into this system, sorted by procmail into the right place.
      - for extra extra credit, hack gotmail to fetch your hotmail, spamfilter it, and sort it.

      I've done all this, it works, and it ROCKS.

  • procmail (Score:2, Interesting)

    by CyberBry (196935)
    You can already pretty much that. I use IMAP and procmail to automagically sort my mail for me into the proper folders.
  • News for Nerds (Score:2, Informative)

    by lydon (26705)

    Why are there so many people complaining about a PS link? The answer is simple: ./ is news for nerds, not for geeks.

    So while the average geek keeps his favorite postscript viewer handy, the standart nerd wonders about such an ancient format and does not know how to feed his acrobat viewer with it...

    Here is the solution for those irritated ones: try this [wisc.edu] piece of ancient software on the ancient adobe format, and you can miracously view it's contents!

    Have fun and keep your google handy!

  • VM & EMACS (Score:3, Informative)

    by pmz (462998) on Wednesday April 24, 2002 @02:09PM (#3403079) Homepage
    I have enjoyed using the VM module for Emacs. It allows sorting your entire Inbox into separate categorized mail boxes via regular expressions. Basically with one shift-A keystroke, my entire day's worth of mailing list stuff gets whisked away into a half-dozen different files. After this, I feel really sorry for people trapped in the Outlook dungeons!
    • could you give an example of your vm-auto-folder-alist? I've been using VM for quite awhile but I haven't tried this feature yet. Just curious how to set the variable to something useful.
      • The vm-auto-folder-alist is basically a list of which fields to scan and what to do with classes of entries in those fields. A simple example is:

        (setq vm-auto-folder-alist ("Sender:" ("mailing-list@domain" . "mailing-list.saved" ) ("mailing-list2@domain" . "mailing-list2.saved" ) ) ( "From:" ( "user@domain" . "user.saved" ) ( "your-e-mail@your-domain" . "sent_mail.saved" ) )

        A more powerful example using regular expressiongs:

        (setq vm-auto-folder-alist ("From:" ( "^.*@dot[.]bomb$" . "dot.bomb.saved" ) ) )

        This will take every e-mail whose From field matches the expression and save it into the file, dot.bomb.saved.

        I think this is by far the most useful and time-saving feature in VM, especially when subscribed to a high-volume mailing list.
  • Done already (Score:5, Informative)

    by Matts (1628) on Wednesday April 24, 2002 @02:43PM (#3403452) Homepage
    "Perhaps even one of them Perl monkeys will quickly hack such a background tool."

    Been done already. Check out Mail::Miner [cpan.org].
  • Once I was at some internet tradeshow in Boston and every other booth seemed to be showing off their e-mail filtering features, each with one or more enormously complicated dialog box. Features! Features! Features!

    My reaction was to want an e-mail reading program that didn't require any filter configuration, though I imagined it would do well to be given a few hints, such as who my boss is, who my mother is, and who my wife is. Other than that, let the program figure it out.

    Imagine the canonical, old-fashioned secretary temp. She ('cause that's what the canonical version was) didn't have to know anything domain-specific to sort the morning mail. Magazines go together, bills go together, personal letters go together, etc.

    I imagine an automated version for my e-mail. Look at who it is "to" (am I on the list?), look at who is "cc"-ed (am I on that list?), look at who it is from (my boss, wife, or mother?), look at who else it is to (boss, wife, or mother?), look at the thread it is part of (is it responding to something I previously wrote?), look at the content (does it mention me, things I have written, my boss, wife, or mother?). Was it sent to a mailing list? Was it written by someone I have explicitly written to (once or many times?)? Was it written by someone who has previously sent me direct e-mail (once or many times?)? Those ideas are just the obvious ones, think of others. Think of more. (Does it talk about sex, credit card merchant accounts, stock tips, or Nigerian money?)

    Now take that and sort it by importance and similarity. Look for a way to present me in a descriptive summary, arranged in a hierarchy with a top-level of, say, 3 to 9 categories, a greatest depth no greater than, say, 4, and keep the sub-branching at intermediate nodes between 3 and 5--but don't max out all those dimensions at once, try to keep the total number of leaf categories to under, say, two dozen. Try to make more important items land higher in the tree and with few siblings, grouped with siblings of similar importance. (Maybe give an importance weight to each e-mail and balance the tree on that scale, that would float e-mails to me from my boss about my mother and wife really high with few siblings.)

    This summary needs to be integrated with a complete index of the e-mail so I can see how a message fits into a larger thread, how it fits into previous e-mails.

    I (the user) would need to tell the program when to make me a summary of my e-mail (e-mail reading is different when a lot comes in or just a little), and I want to be able to browse through old summaries, including deciding to see composite summaries or, say, the last several days, a week (or three), month, year, or 400 days.

    So I think it ends up being a 4-part user interface:

    List of summaries (which can be manipulated).

    A given summary.

    Exhaustive thread/date/subject/sender list (analogous to what every e-mail reader seems to have now). Note that this view could effectively be turned into an exhaustive address book. Frequent (favored) correspondents could be highlighted by me for ease in sending a new e-mail, and also to provide importance hints to the program. This is where I might say who my boss/wife/mother is.

    A body of a (or more) specific e-mail being read, written, or old e-mail (sent or received) being reviewed.

    And I could go on, but I won't. If anyone wants to write such a thing and wants to hear more, send me an, um, e-mail.

    -kb, the Kent who has been saving all his e-mail (including spam!) for a year or so, providing plenty of raw material to test any such program.

  • finding NEW topics (Score:2, Informative)

    by tswaterman (575957)
    Many of these comments are missing the point. The paper is not really about categorizing your email.

    The main result in Kleinberg's paper relates to finding NEW topics that start to appear in the stream. Let's say you already have categorization filters (procmail, keyword filters, your own set of folder hierarchies, whatever...), but there's a new topic that starts showing up in your mail, or in your newsgroup feed, or on CNN. Klienberg's result is a way to find that the new stuff really is NEW, and you might want to group it up together, and make a folder for it. You could do that automatically, or by hand, but first you have to know that there's a topic.

    there's a bunch of other work in this area, what the NLP types call TDT -- "Topic Detection and Tracking" [google.com]

  • jzw [about] of Mozilla/Netscape fame have a hypothetical program called Intertwingle [mozilla.org] which is (Score:5,Interesting) ....
  • Not new, but cool. (Score:3, Informative)

    by jefferson (95937) on Wednesday April 24, 2002 @04:42PM (#3404809) Homepage

    There's been lots of work on auto-classifying email. I did my semester project in Machine Learning on this in 1999. It's a fairly simple study, but it seems like a Naive Bayesian classifier using word counts as features does a pretty decent job of classifying email, and does really well on spam.

    The paper is here here [utexas.edu].

    J.

  • There are plenty of e-mail mining tools in development. This particular work takes one particular approach to mining the data. Whether this approach will turn out to be useful remains to be seen.

FORTRAN is a good example of a language which is easier to parse using ad hoc techniques. -- D. Gries [What's good about it? Ed.]

Working...