Re: Anyone want to write an "intelligent" mail filter?

new topic     » goto parent     » topic index » view thread      » older message » newer message

Irv Mullins wrote:
> Every day I get more annoying SPAM e-mails. Currently it's running about 10 
> spams to every valid e-mail.
> 
> I'm tired of wading thru them, and I'd rather not download them at all. 
> My e-mail client can filter the messages by sender or subject, but most 
> spams now are written to get around those filters. 
> 
> One thing I notice is that nearly 100% of the spams either contain the 
> word "lagos" or long strings of "dictionary" words to confuse the filters:
> 
> "indecisive constitute dakar summitry ajax beaver descendent withal 
> circumlocution asocial voluble inquire convolution replete hitler 
> commendation segregate cognition abstract eject disgustful"
> 
> But very few or none of the more common shorter words that would likely 
> appear in a valid e-mail: "a, and, or, if, you, we, I, to, for, the, this, 
> that....."
> 
> We should be able to come up with a routine which would analyze a given 
> text string and rank it according to its likelyhood of being a 'meaningful' 
> message. Then use that routine in an e-mail client to rank messages and 
> only download from the server those which appear to be 'real'. 
> 
> Ideas?

For the past few months I've been using the e-mail
client in Netscape 7.1. It has a "Bayesian" spam filter
that adapts to the streams of spam and normal mail
that you receive. It works pretty well.

It keeps track of all the words in your incoming e-mail,
and notes how often each word appears
in spam vs normal mail. For example, the word "Euphoria"
might have appeared in 1 of my spam messages and
99 of my normal messages, so if it sees "Euphoria" in a 
message, that would indicate a 99% probability that this
is a normal message. But it doesn't just look at one word.
I believe it looks at the 20 or so words in each message
with the most extreme probabilities. It uses a formula from
Bayesian statistics to combine the probability indicated
by each word into a single overall probability. e.g.
if you had a word that indicated "90% likely to be spam"
and another that said "95% likely to be spam", the result of 
combining those two words might be 97% (or something).
It will move a message out of your inbox into a spam folder 
if the probability of it being spam is quite high,
something like 99%. Obviously you want to keep false 
positives (real mail tagged as spam) to an absolute minimum.

In practice, over a long period of time,
suppose I get 1000 messages of which 900 are spam. It will 
probably move about 800 of the spams and 1 or 2 of the 
non-spams into my spam folder.

With each batch of incoming mail, I check the spam folder 
for non-spams, but usually I can quickly see from the 
subjects and senders that there aren't any non-spams, so I 
click a button to quickly delete all the spams in one 
operation.

Whenever it tags a message incorrectly (usually spam
that it missed), you can click a button to tell it so.
This way it gradually learns and gets smarter.

Also, e-mail from anyone in my address book is automatically
considered non-spam, so the false positives are quite low.

Being able to delete a whole bunch of spams in one
operation saves time. It's also nice that it keeps
my inbox largely clear of distracting spam clutter.

Regards,
    Rob Craig
    Rapid Deployment Software
    http://www.RapidEuphoria.com

new topic     » goto parent     » topic index » view thread      » older message » newer message

Search



Quick Links

User menu

Not signed in.

Misc Menu