Re: Anyone want to write an "intelligent" mail filter?
- Posted by Robert Craig <rds at RapidEuphoria.com> Nov 05, 2003
- 669 views
Irv Mullins wrote: > Every day I get more annoying SPAM e-mails. Currently it's running about 10 > spams to every valid e-mail. > > I'm tired of wading thru them, and I'd rather not download them at all. > My e-mail client can filter the messages by sender or subject, but most > spams now are written to get around those filters. > > One thing I notice is that nearly 100% of the spams either contain the > word "lagos" or long strings of "dictionary" words to confuse the filters: > > "indecisive constitute dakar summitry ajax beaver descendent withal > circumlocution asocial voluble inquire convolution replete hitler > commendation segregate cognition abstract eject disgustful" > > But very few or none of the more common shorter words that would likely > appear in a valid e-mail: "a, and, or, if, you, we, I, to, for, the, this, > that....." > > We should be able to come up with a routine which would analyze a given > text string and rank it according to its likelyhood of being a 'meaningful' > message. Then use that routine in an e-mail client to rank messages and > only download from the server those which appear to be 'real'. > > Ideas? For the past few months I've been using the e-mail client in Netscape 7.1. It has a "Bayesian" spam filter that adapts to the streams of spam and normal mail that you receive. It works pretty well. It keeps track of all the words in your incoming e-mail, and notes how often each word appears in spam vs normal mail. For example, the word "Euphoria" might have appeared in 1 of my spam messages and 99 of my normal messages, so if it sees "Euphoria" in a message, that would indicate a 99% probability that this is a normal message. But it doesn't just look at one word. I believe it looks at the 20 or so words in each message with the most extreme probabilities. It uses a formula from Bayesian statistics to combine the probability indicated by each word into a single overall probability. e.g. if you had a word that indicated "90% likely to be spam" and another that said "95% likely to be spam", the result of combining those two words might be 97% (or something). It will move a message out of your inbox into a spam folder if the probability of it being spam is quite high, something like 99%. Obviously you want to keep false positives (real mail tagged as spam) to an absolute minimum. In practice, over a long period of time, suppose I get 1000 messages of which 900 are spam. It will probably move about 800 of the spams and 1 or 2 of the non-spams into my spam folder. With each batch of incoming mail, I check the spam folder for non-spams, but usually I can quickly see from the subjects and senders that there aren't any non-spams, so I click a button to quickly delete all the spams in one operation. Whenever it tags a message incorrectly (usually spam that it missed), you can click a button to tell it so. This way it gradually learns and gets smarter. Also, e-mail from anyone in my address book is automatically considered non-spam, so the false positives are quite low. Being able to delete a whole bunch of spams in one operation saves time. It's also nice that it keeps my inbox largely clear of distracting spam clutter. Regards, Rob Craig Rapid Deployment Software http://www.RapidEuphoria.com