Text-Mining Your E-mail 229
Misha writes "There have been a number of weeks/months in anyone's life that called for a better organization of your Inbox. filtering and folders work, but it'd be nice to have an text-mining tool running in the background that categorized incoming messages by topic as they arrive. It's nice to see that besides NLP research, there are some great algorithmic advances being done, as seen in this paper. Perhaps even one of them Perl monkeys will quickly hack such a background tool." Note: it's a PostScript file.
there is no way to win... (Score:1, Interesting)
Only way to win is to kill it from the source. End of story.
Postscript document (Score:3, Interesting)
Some minutes of 100% CPU later up pops a PSP window, with the document rendered in a font about five pixels square. Fair enough, I suppose, for what's basically a photograph editing application.
But really, how bizarre, posting something in a low level printer file format. We'll have people posting documents in PCL5 next.
Procmail and Glimpse (Score:2, Interesting)
1. Sort all incoming mail with Procmail and use Mutt to move remaining unsorted items where they belong.
2. Run Glimpse periodically via Crontab so that I can easily/quickly do string searches though the resulting mbox files.
(If anyone sees a problem with this idea, or a way to improve this idea, please reply to this post!)
procmail (Score:2, Interesting)
Re:What's wrong with IMAP ? (Score:5, Interesting)
I have been dissapointed in general with most SMTP, IMAP and POP servers. A real database is the proper way to do things. Email is my #1 app and I want to do complex queries on my archives.
So last year I bit the bullet and wrote a 50 line python program which imported all my mbox and Maildir format archives into a simple postgresql database. 600 megs worth over the last 4 years.
And another simple 50 line php program gives me a web database query interface. It suits my needs now and is much faster than searching through a big (but much much smaller) imap folder with almost every mail program I've tried. With some good design it really shouldn't be too hard to make an industrial strength email database system and I am surprised that it hasn't happened sooner in the open source world.
I think that direct SQL access to the mail database is preferred over IMAP. SQL gives you more capabilities and I find it less problematic than all the various combinations of IMAP servers and mail programs.
Jeff
That's not necessarily the point.. (Score:2, Interesting)
Procmail only goes so far, it's really only useful for simple header scanning.. I could really see a good scanner utility being a valuable tool. Maybe Google should share some of their technology..
I do this too + questions for other domain owners (Score:1, Interesting)
So, if there's any possibility of SPAM, I just invent a new user. Unfortunately, I didn't figure this out quite soon enough and I have some users which get spam and real mail, which I can't afford to filter to trash - people buying their own domains (come on, its like $15 a year) should be thinking ahead.
Also, its not as neat a set up as having my own POP server bounce back the message (which might mean you get off the spam list one day!). More importantly, filtering the To: field, doesn't help me most times, since spammers set the To: to "READTHIS" and use Bcc: for their spammies (is that a word!).
ALSO
Here's an unrelated question for anyone else who owns a domain like me, where they get a catch all POP box.
How do you guys make sure people USE your nice domain name?
In other words, its okay having a POP box, mail.mydomain.com, but you never seem to get offered the services of an SMTP server through which you can send your messages From: this nice address.
I would hazard that most people rely on Reply-To:, which is all very well, except that not all mail clients respect it, and you may want to entirely obscure the actual From:.
Of course, mail clients like Emacs and Mozilla make it easy to arbitrarily set your From:, however you then have to get this through whatever SMTP server you have available (and in order to block spammers and other pranksters, you will increasingly find that most will only send mail if the From: agrees with your user name).
One of the reasons I moved to linux was so I could run sendmail and not rely on other peoples SMTP servers. The is okay at work, since we have direct internet access, but from home when I dial up, it doesn't work.
I don't think my ISP likes to have people sending mail from their own computers, I get name resolution errors from sendmail when attempting to send email (but have no problem with DNS for web), so I think that perhaps the ISPs DNS servers refuse to give up MX records.
Anyone else in a similar boat?
Re:Censorship? (Score:2, Interesting)
The post you're ranting against was a reply to one that suggests filtering is not what we should do. That spam needs to be "killed at the source". Which means legally preventing someone from creating any mail in the first place.
Say what you will about spammers, but that IS censorship.
('Course there's plenty of people here who believe that censorship is fine in this case, but that's not what you're arguing, so I won't either.)