Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
News Science

Text-Mining Your E-mail 229

Misha writes "There have been a number of weeks/months in anyone's life that called for a better organization of your Inbox. filtering and folders work, but it'd be nice to have an text-mining tool running in the background that categorized incoming messages by topic as they arrive. It's nice to see that besides NLP research, there are some great algorithmic advances being done, as seen in this paper. Perhaps even one of them Perl monkeys will quickly hack such a background tool." Note: it's a PostScript file.
This discussion has been archived. No new comments can be posted.

Text-Mining Your E-mail

Comments Filter:
  • by Anonymous Coward on Wednesday April 24, 2002 @01:58PM (#3402583)
    They'll end up finding a loophole in your filtering, or you'll end up filtering out real emails.

    Only way to win is to kill it from the source. End of story.
  • Postscript document (Score:3, Interesting)

    by Tim Ward ( 514198 ) on Wednesday April 24, 2002 @02:30PM (#3402836) Homepage
    Somewhat to my astonishment when I clicked on the link up popped a box asking me to confirm Postscript Renderer options! I had no idea that I had anything on this box that could read Postscript.

    Some minutes of 100% CPU later up pops a PSP window, with the document rendered in a font about five pixels square. Fair enough, I suppose, for what's basically a photograph editing application.

    But really, how bizarre, posting something in a low level printer file format. We'll have people posting documents in PCL5 next.
  • Procmail and Glimpse (Score:2, Interesting)

    by Aknaton ( 528294 ) on Wednesday April 24, 2002 @02:42PM (#3402906)
    I'm in the middle of setting up a NetBSD box for email archiving and here is how I hope/intend to set up my e-mail:

    1. Sort all incoming mail with Procmail and use Mutt to move remaining unsorted items where they belong.

    2. Run Glimpse periodically via Crontab so that I can easily/quickly do string searches though the resulting mbox files.

    (If anyone sees a problem with this idea, or a way to improve this idea, please reply to this post!)
  • procmail (Score:2, Interesting)

    by CyberBry ( 196935 ) on Wednesday April 24, 2002 @02:46PM (#3402929) Homepage
    You can already pretty much that. I use IMAP and procmail to automagically sort my mail for me into the proper folders.
  • by statusbar ( 314703 ) <jeffk@statusbar.com> on Wednesday April 24, 2002 @02:48PM (#3402942) Homepage Journal
    DBMAIL looks cool, once it supports postgresql it would be awesome.

    I have been dissapointed in general with most SMTP, IMAP and POP servers. A real database is the proper way to do things. Email is my #1 app and I want to do complex queries on my archives.

    So last year I bit the bullet and wrote a 50 line python program which imported all my mbox and Maildir format archives into a simple postgresql database. 600 megs worth over the last 4 years.

    And another simple 50 line php program gives me a web database query interface. It suits my needs now and is much faster than searching through a big (but much much smaller) imap folder with almost every mail program I've tried. With some good design it really shouldn't be too hard to make an industrial strength email database system and I am surprised that it hasn't happened sooner in the open source world.

    I think that direct SQL access to the mail database is preferred over IMAP. SQL gives you more capabilities and I find it less problematic than all the various combinations of IMAP servers and mail programs.

    Jeff
  • by cheesyfru ( 99893 ) on Wednesday April 24, 2002 @02:51PM (#3402958) Homepage
    Spam filtering is one possible application of this type of tool, but the more useful involves taking the mail you *do* want, and sorting it into logical buckets. For instance, let's say work on several open source projects, belong to a couple organizations, and have a real-life job. You could toss a filter in your email that scans each incoming message and throws it in the proper bucket. This allows you to logically separate your mail to reduce confusion of each non-overlapping category.

    Procmail only goes so far, it's really only useful for simple header scanning.. I could really see a good scanner utility being a valuable tool. Maybe Google should share some of their technology.. :-)
  • by Anonymous Coward on Wednesday April 24, 2002 @03:29PM (#3403287)
    I also use this technique for my externally hosted domain...I get all the mail addressed to any user in the domain, but its easy to set up mail client filters to remove those with are addressed To:, say, potentialspammer1@mydomain.

    So, if there's any possibility of SPAM, I just invent a new user. Unfortunately, I didn't figure this out quite soon enough and I have some users which get spam and real mail, which I can't afford to filter to trash - people buying their own domains (come on, its like $15 a year) should be thinking ahead.

    Also, its not as neat a set up as having my own POP server bounce back the message (which might mean you get off the spam list one day!). More importantly, filtering the To: field, doesn't help me most times, since spammers set the To: to "READTHIS" and use Bcc: for their spammies (is that a word!).

    ALSO

    Here's an unrelated question for anyone else who owns a domain like me, where they get a catch all POP box.

    How do you guys make sure people USE your nice domain name?

    In other words, its okay having a POP box, mail.mydomain.com, but you never seem to get offered the services of an SMTP server through which you can send your messages From: this nice address.

    I would hazard that most people rely on Reply-To:, which is all very well, except that not all mail clients respect it, and you may want to entirely obscure the actual From:.

    Of course, mail clients like Emacs and Mozilla make it easy to arbitrarily set your From:, however you then have to get this through whatever SMTP server you have available (and in order to block spammers and other pranksters, you will increasingly find that most will only send mail if the From: agrees with your user name).

    One of the reasons I moved to linux was so I could run sendmail and not rely on other peoples SMTP servers. The is okay at work, since we have direct internet access, but from home when I dial up, it doesn't work.

    I don't think my ISP likes to have people sending mail from their own computers, I get name resolution errors from sendmail when attempting to send email (but have no problem with DNS for web), so I think that perhaps the ISPs DNS servers refuse to give up MX records.

    Anyone else in a similar boat?
  • Re:Censorship? (Score:2, Interesting)

    by alouts ( 446764 ) on Wednesday April 24, 2002 @03:42PM (#3403435)
    Very valid points but:

    The post you're ranting against was a reply to one that suggests filtering is not what we should do. That spam needs to be "killed at the source". Which means legally preventing someone from creating any mail in the first place.

    Say what you will about spammers, but that IS censorship.

    ('Course there's plenty of people here who believe that censorship is fine in this case, but that's not what you're arguing, so I won't either.)

FORTRAN is not a flower but a weed -- it is hardy, occasionally blooms, and grows in every computer. -- A.J. Perlis

Working...