Filtering spam with bmf, procmail and mutt

©2003-2005 by Michael Knudsen.
$Date: 2005/01/01 20:09:54 $

This or excerpts hereof may under no circumstances be distributed in any form. This might change once the author is satisfied with content, layout, structure etc.

The eternal annoyance of spam

Spam or UCE, unsolicited commercial email, has become an increasingly annoying issue for most people using email. If using your real address on public forums such as Usenet/news it is almost certain that, within days or weeks, it will have been harvested and sold, and in return people will give you an exclusive offer on Viagra and on penis enlargement. You may also have received complaints from people who threaten to report you to whatever agency handles spam abuse in your/their country, because you allegedly have sent them an offer similar to the ones mentioned above.

Spammers are always one step ahead

If spam always looked the same, detecting it with computer software would be easy. However, it is generally not the case. Humans recognize spam easily, because they are capable of analyzing the context of the text. Computers are only capable of recognizing individual letters and words. They cannot make out the meaning of phrases and sentences. Thus, when detecting spam with computers we must resort to recognizing characteristics of the text itself instead of the contents. Such characteristics could be all-uppercase text and such. However, spammers change or try to minimize these characteristics, so with time, detection this way becomes harder.

Spammers make the rules, and then we adapt to them. Once we have learned the rules, spammers change them.

Word-based filtering

One thing the spammers cannot change: What they write. There are only so many words which can be used to offer stock tips, cross-atlantic dental care and other marvelous business offers. Thus, one might suggest to simply filter out all messages containing one or more words from a list of blacklisted words. This solution does not really work well, because a list of words positively only occurring in spam is really hard to make. You would not want to throw away the message with that raunchy joke about a penis enlargement going wrong. Such a list will either be ineffective (lots of false negatives due to only few words being blacklisted) or too effective (lots of false positives due to one or more "regular" words being blacklisted).

Instead of having an occurrence of a word determine whether a message is spam or not, one could consider how often a word is used in spam and non-spam and, based on this, assign this word a probability for the message being spam. For instance, if 9 out of 10 messages are spam if they contain the word "offer", the word "offer" is assigned a spam probability of 0.9, 90%. This is a much more fine-grained way of doing spam-detection, and this is the approach used with bmf, Bayesian Mail Filter.

The software

bmf

bmf uses (FIXME: Link to resource on this.) bayesian networks to determine whether the input is spam or not. It does this by maintaining a database of words and their spamicity-weight. bmf is capable of updating this database by itself, which means that it will adapt to how spam looks. If spammers begin using new words, bmf will adapt to them being "spam words".

A nice feature of bmf is that it can work on either the standard input or on a named file, and it has a passthrough-mode in which the processed email is output on standard output with an extra header saying if the mail was determined to be spam. This makes bmf ideal to use with another mail processing package, procmail.

procmail

procmail uses a file with a sequence of rules (recipes is the official term) to determine what to do with the mail. A rule can e.g. add the mail to a mailbox in a variety of formats or feed it to a program which does something to it. procmail is a very powerful package which can do virtually anything to mail.

Software setup

bmf

For bmf to be able to distinguish between spam and non-spam, you must train it with both kinds of mail so it can assign weights to lots of common words. It must be trained before it is used to analyze mail and tag it as spam to make it as precise as possible and avoid false positives.

The initial training of bmf is rather easy to do. You simply collect all your spam in one mailbox (e.g. $HOME/Maildir/bad/) until you have at least 50 messages of spam. The more the better, but all messages should be from roughly the same period of time (do not run bmf on all spam received from two years ago until now). Do the same for your proper mail and save it in another mailbox (e.g. $HOME/Maildir/good/).

Training of bmf is done in the following way (assuming you are using Maildir mailboxes and all messages are marked as new):

$ cd ~/Maildir/bad/new
$ for i in * ; do bmf -s -i $i ; done
$ cd ~/Maildir/good/new
$ for i in * ; do bmf -n -i $i ; done

Alternatively, if the messages are not necessarily marked as new:

$ find ~/Maildir/bad -type f -exec bmf -s -i {} \;
$ find ~/Maildir/good -type f -exec bmf -n -i {} \;

Procmail

Assuming bmf is properly trained, it is now time to make your procmail recipes to filter away spam. Two recipes are needed for this: One passes your mail through bmf, the other checks the output from bmf:

:0fw
| bmf -p

:0:
* ^X-Spam-Status: Yes
spam/

The first rule passes the message to bmf. If the mail is found to be spam, bmf inserts a header, "X-Spam-Status: Yes". The second rule is applied if such a header is found, and it causes the mail to be delivered to the "spam" mailbox in Maildir format (a trailing / means Maildir format).

It is important to note that bmf updates its weights when running in passthrough-mode ("bmf -p"). This means if it wrongly deems a mail to be non-spam, it will update its weights after processing the mail. This means that future mail scannings will be done using a wrong spam parameter set, thus it will be more error-prone. In these cases, bmf must be notified of its error so it can readjust its weights to ensure more precise results in the future.

Correcting bmf weights

Command line

bmf allows you to fix wrongly registrated messages. "bmf -S" registers the input as spam and undo a prior registration as non-spam, "bmf -N" registers the input as non-spam and undo a prior registration as spam. "bmf -t" tests the input and reports whether it thinks it is spam or not. Testing does not update the weights.

$ bmf -S -i file-containing-undetected-spam
$ bmf -N -i file-containing-proper-mail-taken-for-spam
$ bmf -t -i file
# Spamicity: 1.000000
# 'mk2709' -> 0.990000
# 'indiatimes.com' -> 0.990000
# 'mk25' -> 0.990000
# 'msn.com' -> 0.990000
# 'mk2568' -> 0.990000
# 'mk2547' -> 0.990000
# 'mk2541t' -> 0.990000
# 'mk250cal' -> 0.990000
# 'mk24y' -> 0.990000
# 'aol.com' -> 0.990000
# 'mk24tempe' -> 0.990000
# 'metcom.com' -> 0.990000
# 'mead.net' -> 0.990000
# 'mar-con.com' -> 0.990000
# 'mk24' -> 0.990000

Updating weights from mutt

Finding the right file and invoking bmf correctly on it can be a tiresome task. The task becomes even worse when all of your mail is stuffed together in a single, lousy mbox-file. Fortunately, this can be done from mutt via simple keystrokes. Add the following to your .muttrc:

macro index \ed "<enter-command>unset wait_key\n<pipe-entry>bmf -S\n<enter-command>set wait_key\n<save-message>=spam/\n" "Tags a given message as SPAM"
macro index \eu "<pipe-entry>bmf -N\n<enter-command>set wait_key\n<save-message>=incoming/\n" "Untags a given message as SPAM"
macro index \et "<pipe-entry>bmf -t\n<enter-command>set wait_key\n" "Tests a given message to see if it is SPAM"

Keyboard shortcuts in the mail folder overview are now added. If you press ESC-d, you teach bmf that the mail was spam. If you press ESC-u, you teach bmf that the mail was, in fact, not spam. Pressing ESC-t invokes the spam test on the highlighted mail.

If you also want the keys to work when reading a message, add the following lines to your .muttrc:

macro pager \ed "<enter-command>unset wait_key\n<pipe-entry>bmf -S\n<enter-command>set wait_key\n<save-message>=spam/\n" "Tags a given message as SPAM"
macro pager \eu "<pipe-entry>bmf -N\n<enter-command>set wait_key\n<save-message>=incoming/\n" "Untags a given message as SPAM"
macro pager \et "<pipe-entry>bmf -t\n<enter-command>set wait_key\n" "Tests a given message to see if it is SPAM"