Spam Filtering using Google/GMAIL

keywords: spam filter google gmail pine elm

Email has become more and more poluted with spam and spam like messages filling up our inboxes every day. There are many different solutions to filter out these unwanted messages, but few allow you to efficiently tag close to 100% of the junk without incuring large numbers of false positives or forcing you to ''train'' up your filter on huge mail samples. Most filters also don't adapt well to new types of spam without the user having to manually flag them first.

The real problem of course is that being a physicist, I've been using email almost since the dawn of it's creation and I have a HUGE archive of mails dating back to the days of Bitnet and VMS machines that I don't want to abandon. This means switching over to a commercial web based service is not viable, nor is really switching platforms away from the central physics mail servers which have been processing my mail for over a decade. Add to that the fact that I *LIKE* using PINE as my mail reader due to its low overhead, terminal based nature, and fast commandline sorting and filing.....and you have a problem.

Now before going further, there are filtering solutions that work natively on your mail server and can over time become fairly accurate (I'm talking about SpamAssassin with Bayes filtering and large large huge mega samples) but none so far have compared to using Google's GMAIL as a pass through filter.

Overview

Google released a free email service called GMAIL back in 2004. What was unussual about the service was that it indexed and stored all your email in a manner similar to the way that Google's search engine tools index web pages. This meant that Google could effectively "read" your email and then target ads at you, but it also meant that Google/GMAIL could create an extremely efficient spam detection engine.

Normally when you use Bayesian filtering, your filter is only as good as the probablity distribution functions that you generate for your different outcomes. Low statistics mean larger statistical errors, and more misclassification.

When applied to email the same is true. If a normal person only recieves a few "good" emails each day and a few "spam" emails, then each sample set increases very slowly. Since the error goes as the root the sample set, this means that in the limit of large N you do win, but not as quickly as you would like.

For a sample set of 100 entries you end up with about a 10% error, with a 1000 entries about a 3% error and with 10,000 entries still a 1% error. This is good but not good enough!

Google unlike a normal person is processing MILLIONS of emails daily. This means that it's Bayes filters end up being extremely accurate in a very short time. More over Google does something that you and I don't, they can store and compare any given incoming message to any prior message to flag duplicates. While this may not seem impressive at first, consider that spammers operate by sending the SAME message to millions of people. Hence if you can compare your mail to what other people are also recieving you have a very good chance of defeating bulk mailers.

The end result is that Google/GMAIL has the infrastructure to provide a very good filtering system, and more importantly WE as users can tap into this to enrich our lives!

Setting up Google's GMAIL to filter spam for your local client

Google’s free email service GMAIL can be used as a very effective spam filtering system, and can be setup to operate in a passive manner where all the mail is still addressed to your normal email address, resides on your preferred server, and can be browsed, filed, saved by your favorite email program (i.e. PINE, Elm, etc...). To do this there are a few requirements that must be met.

Requirements to use GMAIL in Filter Mode:

  1. A GMAIL Account
  2. An email account on a Unix like host (i.e. departmental server running a Unix or Linux like OS)
  3. The "sendmail" server that runs on the host computer must have the "procmail" extensions enabled (most modern ones do)

The first item is self explainitory. If you don't have a GMAIL account, then ask around and get one either by having a friend invite you or having Google text message you an access code to your mobil phone.

The second two items are also straight forward, but knowning if your version of sendmail supports procmail might not. Most modern installations enable procmail at the server level. Type "man procmail" and if you get the man page, then you probably have nothing to worry about. If you don't, go talk to your system admin and just ask him.

Once you have satisfied yourself that you meet the requirements, do the follow:

Configure GMAIL as your spam filtering engine

  1. Establish a GMAIL account. This will give you an email address of the form user@gmail.com
  2. Log into GMAIL and under the settings, mark that you want all your mail forwarded to your current email address where you get your mail (i.e. user@phys.virginia.edu)
  3. Log into the machine where you read your mail and create a “.procmailrc” file with the following mail handling recipes:
#####################################################################
# .procmailrc file for Account "johndoe@phys.virginia.edu"
#
#####################################################################
#
# Setup the general variables that are needed for procmail
#
MAILDIR=$HOME/mail
DEFAULT=/var/spool/mail/johndoe
GMAIL=johndoe@gmail.com
#####################################################################
# If a mail message has a X-Forwarded-For 
# line in its header coming from gmail (i.e. gmail
# forwarded the mail to you) then file it in 
# the standard inbox
:0
* ^X-Forwarded-For: johndoe\@gmail\.com
$DEFAULT
 
##############################################
# If we get this far then the message doesn't
# have a X-Forwarded-For line, meaning it
# hasn't been through gmail yet....so we forward
# the mail to gmail for spam filtering!!!!
:0
! $GMAIL

This will setup a single delivery forwarding loop between your mail machine and GMAIL. The order of these recipies is extremely important. The delivery recipe that files mail into the $DEFAULT folder MUST come first, otherwise the mail will just back to GMAIL over and over again on each pass through.

Let me explain....

Using this recipe what happens is that all mail to you comes in and is checked to see if it has taken a trip to gmail for filtering (this is the pattern match for an X-Forwarded-For line matching your gmail account.) If it hasn’t yet been to GMAIL then the recipe doesn't match and the flow falls down to the next section which forwards the mail off to GMAIL for checking. If it has been there already then we can assume that it went throught the spam filters there, and we file it in our normal mail folder.

Since GMAIL does spam checking BEFORE mail forwarding, all mail that is sent to GMAIL regardless of where it is eventually destine to go, ends up getting spam checked. The local machine only keeps copies of the mails that GMAIL has cleared, and stores them locally as it always has, which allows you to then continue to use your favorite mail reader like you have been, but with out the garbage. This approach eliminates almost ALL spam on the local account and has the benefit of also creating a backup copy of all emails you recieve at GMAIL.

Notes

  1. When setting up spam filtering, check to make sure that normal mail is still being delievered. Even small type-O’s can mess up mail system and cause your mail to be rejected.
  2. The default "maximum hops" that sendmail uses to filter out bouncy mail is 25. Most modern servers have increased this number, but if you find that your mail getting to GMAIL but not back (and has an error about maximum hops) this could be the problem. Just increasing the limit in the sendmail.cf file will fix this.
  3. Make sure to log in every so often to GMAIL and clean things out. You're spamfolder there will start building up rapidly, so don't be shy about killing those old spams (Google keeps all the statistics, so you don't have to worry about retaining large spam samples)
  4. You will find some spam still get's through. Go to GMAIL peroidicly and use the "report spam" functionality to help improve GMAIL's filters.
  5. To undo the forwarding loop just move the .procmailrc file to a different name.


Andrew J. Norman
Last modified: Fri Aug 25 15:04:22 EDT 2006