Bogofilter
Encyclopedia
Bogofilter is a mail filter that classifies e-mail
E-mail
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...

 as spam or ham (non-spam) by a statistical
Statistics
Statistics is the study of the collection, organization, analysis, and interpretation of data. It deals with all aspects of this, including the planning of data collection in terms of the design of surveys and experiments....

 analysis of the message's header and content (body). The program is able to learn from the user's classifications and corrections. It was originally written by Eric S. Raymond
Eric S. Raymond
Eric Steven Raymond , often referred to as ESR, is an American computer programmer, author and open source software advocate. After the 1997 publication of The Cathedral and the Bazaar, Raymond was for a number of years frequently quoted as an unofficial spokesman for the open source movement...

, and is now maintained together with a group of contributors by David Relson, Matthias Andree and Greg Louis.

The statistical technique used is known as Bayesian filtering
Bayesian spam filtering
Bayesian spam filtering is a statistical technique of e-mail filtering. It makes use of a naive Bayes classifier to identify spam e-mail.Bayesian classifiers work by correlating the use of tokens , with spam and non spam e-mails and then using Bayesian inference to calculate a probability that an...

 and its use for spam was first described by researchers at Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...

 in the paper A Bayesian Approach to Filtering Junk E-mail. Gary Robinson
Gary Robinson
Gary Robinson is an American software engineer notable for his mathematical algorithms to fight spam.-Fighting spam with algorithms:In 2003, Robinson published an article in Linux Journal which discussed mathematical approaches for fighting spam which led to work along with Tim Peters on the...

, in his weblog Rants, suggests some refinements for improved discrimination between spam and ham. Bogofilter's primary algorithm uses the f(w) parameter and the Fisher inverse chi-square technique that he describes.

Bogofilter is run by an MDA
Mail delivery agent
A mail delivery agent or message delivery agent is a computer software component that is responsible for the delivery of e-mail messages to a local recipient's mailbox...

 script to classify an incoming message as spam or ham (using wordlists stored by BerkeleyDB, SQLite3
SQLite
SQLite is an ACID-compliant embedded relational database management system contained in a relatively small C programming library. The source code for SQLite is in the public domain and implements most of the SQL standard...

 or QDBM). Bogofilter provides processing for plain text and HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

. It supports multi-part MIME
MIME
Multipurpose Internet Mail Extensions is an Internet standard that extends the format of email to support:* Text in character sets other than ASCII* Non-text attachments* Message bodies with multiple parts...

 message with decoding of base64, quoted-printable
Quoted-printable
Quoted-printable, or QP encoding, is an encoding using printable ASCII characters to transmit 8-bit data over a 7-bit data path or, generally, over a medium which is not 8-bit clean...

, and uuencoded text and ignores attachments, such as images.

Standard tests at TREC 2005 show that Bogofilter compares well to its competitors spambayes
SpamBayes
SpamBayes is a Bayesian spam filter written in Python which uses techniques laid out by Paul Graham in his essay "A Plan for Spam". It has subsequently been improved by Gary Robinson and Tim Peters, among others....

, CRM114 and DSPAM
DSPAM
DSPAM is a free software statistical spam filter written by Jonathan A. Zdziarski, author of the book Ending Spam and other books. It is intended to be a scalable, content-based spam filter for large multi-user systems...

. Other competitors include, but are not limited to Spamprobe  and QSF.

Bogofilter is written in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

, and runs on Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

, FreeBSD
FreeBSD
FreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...

, NetBSD
NetBSD
NetBSD is a freely available open source version of the Berkeley Software Distribution Unix operating system. It was the second open source BSD descendant to be formally released, after 386BSD, and continues to be actively developed. The NetBSD project is primarily focused on high quality design,...

, OpenBSD
OpenBSD
OpenBSD is a Unix-like computer operating system descended from Berkeley Software Distribution , a Unix derivative developed at the University of California, Berkeley. It was forked from NetBSD by project leader Theo de Raadt in late 1995...

, Solaris, Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

, HP-UX
HP-UX
HP-UX is Hewlett-Packard's proprietary implementation of the Unix operating system, based on UNIX System V and first released in 1984...

, AIX
AIX operating system
AIX AIX AIX (Advanced Interactive eXecutive, pronounced "a i ex" is a series of proprietary Unix operating systems developed and sold by IBM for several of its computer platforms...

 and other platforms.

See also

  • Blacklist
    Blacklist (computing)
    In computing, a blacklist or block list is a basic access control mechanism that allows everyone access, except for the members of the black list . The opposite is a whitelist, which means allow nobody, except members of the white list...

  • Greylisting
    Greylisting
    Greylisting is a method of defending e-mail users against spam. A mail transfer agent using greylisting will "temporarily reject" any email from a sender it does not recognize. If the mail is legitimate the originating server will, after a delay, try again and, if sufficient time has elapsed, the...

  • Whitelist
  • Tarpit

External links

  • Official homepage
  • A Plan for Spam – An essay by Paul Graham
    Paul Graham
    Paul Graham is a programmer, venture capitalist, and essayist. He is known for his work on Lisp, for co-founding Viaweb , and for co-founding the Y Combinator seed capital firm...

    discussing the main ideas behind this program


This article, or an earlier revision of it, was edited from bogofilter's homepage.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK