Re: Anyone want to write an "intelligent" mail filter?
Irv Mullins wrote:
> Every day I get more annoying SPAM e-mails. Currently it's running about 10
> spams to every valid e-mail.
>
> I'm tired of wading thru them, and I'd rather not download them at all.
> My e-mail client can filter the messages by sender or subject, but most
> spams now are written to get around those filters.
>
> One thing I notice is that nearly 100% of the spams either contain the
> word "lagos" or long strings of "dictionary" words to confuse the filters:
>
> "indecisive constitute dakar summitry ajax beaver descendent withal
> circumlocution asocial voluble inquire convolution replete hitler
> commendation segregate cognition abstract eject disgustful"
>
> But very few or none of the more common shorter words that would likely
> appear in a valid e-mail: "a, and, or, if, you, we, I, to, for, the, this,
> that....."
>
> We should be able to come up with a routine which would analyze a given
> text string and rank it according to its likelyhood of being a 'meaningful'
> message. Then use that routine in an e-mail client to rank messages and
> only download from the server those which appear to be 'real'.
>
> Ideas?
For the past few months I've been using the e-mail
client in Netscape 7.1. It has a "Bayesian" spam filter
that adapts to the streams of spam and normal mail
that you receive. It works pretty well.
It keeps track of all the words in your incoming e-mail,
and notes how often each word appears
in spam vs normal mail. For example, the word "Euphoria"
might have appeared in 1 of my spam messages and
99 of my normal messages, so if it sees "Euphoria" in a
message, that would indicate a 99% probability that this
is a normal message. But it doesn't just look at one word.
I believe it looks at the 20 or so words in each message
with the most extreme probabilities. It uses a formula from
Bayesian statistics to combine the probability indicated
by each word into a single overall probability. e.g.
if you had a word that indicated "90% likely to be spam"
and another that said "95% likely to be spam", the result of
combining those two words might be 97% (or something).
It will move a message out of your inbox into a spam folder
if the probability of it being spam is quite high,
something like 99%. Obviously you want to keep false
positives (real mail tagged as spam) to an absolute minimum.
In practice, over a long period of time,
suppose I get 1000 messages of which 900 are spam. It will
probably move about 800 of the spams and 1 or 2 of the
non-spams into my spam folder.
With each batch of incoming mail, I check the spam folder
for non-spams, but usually I can quickly see from the
subjects and senders that there aren't any non-spams, so I
click a button to quickly delete all the spams in one
operation.
Whenever it tags a message incorrectly (usually spam
that it missed), you can click a button to tell it so.
This way it gradually learns and gets smarter.
Also, e-mail from anyone in my address book is automatically
considered non-spam, so the false positives are quite low.
Being able to delete a whole bunch of spams in one
operation saves time. It's also nice that it keeps
my inbox largely clear of distracting spam clutter.
Regards,
Rob Craig
Rapid Deployment Software
http://www.RapidEuphoria.com
|
Not Categorized, Please Help
|
|