Fighting Spam @ Clarinet

July 14, 2011 3987 views

Feeling annoyed at getting 1 or 2 inappropriate messages in your e-mail box each day? Is real mail being squeezed out by spam? Spare a thought for the service providers who have already thrown away 50 percent of e-mail sent because they are obviously spam.

Clarinet Internet Solutions is a small ISP located in Melbourne Australia. Like other ISPs we have to deal with large quantities of e-mail a large proportion of that e-mail contains Unsolicited Commercial E-Mail (UCE) and malware. In this article we explain some of the strategies we employ to keep our customers sane and safe.

Introduction

Fighting spam, for the service provider, consists of mounting a defence from a series of ongoing attacks. Some of these attacks are new, but the vast majority are repeats of strategies that worked some time in the past. In the case of old strategies, we can respond quickly and effectively because we already have mechanisms in place; new strategies tend to succeed for a time because we have to engineer a solution.

Service providers have a number of weapons in their arsenal including:

Greylisting
Black lists
Forward and Reverse DNS checks
Mail filtering
Spam traps

In addition the final recipients of e-mail can use:

Personal Virus filters
Personal Mail filters
Learning mail filters

Before discussing the defences against spam it is useful to understand the enemy.

Spam, Scams, Phishing, and Malware

Over half of the e-mail arriving at Clarinet’s mail servers is unwanted by the final recipients. This e-mail is commonly known as spam. However, it is more useful to classify this e-mail into a number of sub-types based on the objectives of the sender:

Unsolicited Commercial E-mail (UCE) Senders of this type of e-mail want the recipients to buy an item. These messages contain an offer for goods or services and some instructions for ordering them. This type of e-mail always contains one piece of information that identifies the sender, otherwise there is no way for them to collect the orders. UCE works because e-mail is cheap to send so even very low response rates result in a profit. Strictly speaking the word spam just applies to this type of e-mail, however, its casual usage has increased to cover any unwanted e-mail. Scams Although P.T. Barnum didn’t coin the expression “There’s a Sucker Born Every Minute”, the Internet is rife with schemes that assume this is true and are designed to separate people from their money. These schemes often promise riches in exchange for little effort if the recipient will just provide a little money or access to their bank accounts. Once again the low cost of sending out e-mails makes it cost effective to find the relatively few respondents. Phishing Phishing attempts to trick the recipient into providing personal information, typically usernames, passwords and bank account details to the sender. These schemes are quite sophisticated in that they often exactly reproduce the look and feel of a genuine communication from a bank or service provider. Malware Viruses, bots and spyware are programs which the sender of the message wants the recipient to run:

Viruses just want to reproduce but may also carry a payload that damages the running computer or installs spyware or a bot
Bots respond to instructions and can be used to send e-mail or damage the services of other users
Spyware extracts information from the target computer. This information could include the keystrokes of passwords or the contents of files

The objectives of the sender shape the type of message sent and can help us to disguish between desired and undesired e-mail. For instance:

Malware is almost always an executable program, thus we only need to check executable programs to see if they are a recognised virus, bot or spyware

Spam Defence at the ISP

Defensive Layers

ISPs practice defence in depth when fighting spam. No single technique is 100 percent effective and as we acknowledge in the section on false positives and false negatives tightening the rules necessarily misclassifies more good e-mail as spam.

Defence in depth consists of layering up various imperfect defences in the hope of gradually whitling down the number of attackers that get through at each layer. In the following sections we will explain the details of each layer in our defensive strategy.

Defences do not stop at the ISP, individual recipients can employ tools that can filter their own mail. These tools can be more aggressive than those employed by the ISP as these tools have a better idea about what interests and definitely does not interest a particular recipient. These tools will be picked up in a later section.

False positives, False Negatives

The difficulty for the service provider is that senders of unwanted e-mails work hard to make them look like ordinary e-mail that the recipients want. The service provider sorts e-mails into groups:

We definitely think this is spam
Not sure
We definitely think this is a desired e-mail

Accidentally putting a desired e-mail into the spam pile is called a false positive and accidentally putting a spam into either of the other categories is a false negative.

Because only the final recipient of an e-mail can truly know whether or not an e-mail is desired there is always a risk of misclassification. For instance:

almost no-one wants to receive a virus in an e-mail message, however, a computer security researcher may have asked a colleague to have send him an example of a new virus for his virus collection.

For any method the rate of false positives is inversely related to the rate of false negatives. Thus tightening the rules decreases the amount of spam, but at the cost of increasing the number of messages that are misclassified as spam.

Greylisting

We use a specially modified version of OpenBSD’s pfspamd to provide this service. Our patches to pfspamd:

allow it to work on FreeBSD with IPFW
enable stuttering to discourage bulk e-mailers by wasting their time
delay closing the connection after issuing the temporary failure message to prevent some mail servers from assuming the connection was lost and ignoring the temporary failure message
add connection ids for tracking sessions in logs

We have released a tar ball of our port and fed back our changes to the FreeBSD project.

Greylisting is the most recent addition to our arsenel of anti-spam tools. The concept of greylisting is based on the observation that normal mail servers attempt to re-send mail if they are politely told that the mail server is unable to accept it at this time; on the other hand bulk e-mail tools are designed to send as much e-mail as possible, so when they are told to come back later, they tend not to. The secret to making greylisting not delay all e-mail is to allow mail servers which have successfully retried sending an e-mail to always get through immediately.

There are a number of problems with greylisting:

Some mail providers have pools of mail servers and messages will only get accepted when a message comes from the same server twice.
Some mail clients try to talk directly to their destinations and hence don’t retry regularly.
The first time a mail server is seen its e-mail is delayed.

These problems are principally addressed by identifying valid mail servers and initialising the greylister so that valid mail servers are not delayed.

Greylisting is useful for protecting mail servers from previously undiscovered bulk e-mailers. This mechanism prevents spam getting into the e-mail system. It has the additional advantage of restricting the bulk e-mailers to sending only a small message to the target mail server, this limits the cost incurred by the bulk e-mailer abusing the service provider’s mail server.

Black lists

We use several black lists including:

combined.njabl.org
xbl.spamhaus.org
bl.spamcop.net

As sources of spam are identified these are reported to operators of real time black lists. Service providers who subscribe to the black list check each incoming mail server connection against the black list, if it appears on the list the message is not accepted and the reject message suggests that the genuine senders should appeal to the black list to be removed. Spammers don’t get these messages as they tend to conceal their identity by pretending to be users of other systems.

Some black lists also list misconfigured mail servers that can be exploited by spammers for sending e-mails indirectly. These misconfigured mail servers are known as open relays.

Black lists can be highly effective as they can gather information rapidly from many corners of the globe.

Black lists are imperfect too:

Black lists are necessarily behind the producers of spam: They can only list new sources of spam; and misconfigured mail servers after they are detected
The quality of the black list depends on the dilligence of the maintainers. Maintainers need to both add new sources quickly and remove incorrectly added sites quickly. A failure to do either of these results in spam getting through or valid e-mails rejected

As with greylisting, black listing only requires a small amount of the message to be recieved before a decission is made, once again limiting the cost of a spammer’s abuse of the mail server.

Forward and Reverse DNS checks

Clarinet does not reject a mail just because it fails the forward and reverse DNS checks. We just add points to it its spamassasin score (see mail filtering).

Humans work well with names whilst computers work well with numbers. The Domain Name System (DNS) translates names to numbers and vice-versa. Service providers can check:

that the site talking to it has a reverse DNS entry ie. has a name associated with its number
that the name gained from the reverse DNS entry when looked up matches the IP address of the site talking to it

Because forward and reverse IP addresses are often controlled by two different organisations (forward IP addresses are controlled by the owners of the name and reverse IP addresses are controlled by the owners of the address) where they don’t match it is highly likely that the site talking to the server is not intended to be an e-mail server and might be a bot or a bulk e-mailer.

Mail filtering

There are many aspects to mail filtering:

Pattern matching
Dangerous attachment type removal
Virus filtering
Optical character recognition
Image statistical characteristics

Pattern Matching

Many spam messages contain sets of words or characters that are common in spam but not common in ordinary messages. For instance a message containing a misspelling of “Viagra” and instructions on how to buy it on line is quite likely to be spam. By allocating points to each of these sets of words or characters, a score can be computed of each message. Messages with high scores can be discarded and messages with intermediate or low scores are passed on to the recipient.

In addition to matching the content of the message, the headers of the message are examined for features that indicate that part of the message path has been faked or that the message is from a non-existant source. Valid e-mail is very unlikely to have these features hence they attract a high number of points.

Dangerous attachment type removal

Some attachments can be used as vectors for malware or phishing. These attachement types are typically used for scripts and programs. Furthermore, there are some attachments that have no meaning when sent through e-mail. An example of a useless attachement is:

.lnk files that refer to local files are not useful on machines other than the sending computer, unless there is already a file on the same place on the receiving machine with the same contents

Removing attachments because they might contain something bad protects customers from new malware for which signatures have yet to be generated, but also generates resentment from knowledgable users when harmless items have been removed.

Virus filtering

We use ClamAV to filter for viruses and update both the engine and the signatures regularly.

Attachments are unpacked and scanned. The scanner checks for signatures in the files and if it matches then the the e-mail is rejected.

Virus filtering has two flaws, both minor:

Like pattern matching, virus scanners are always playing catchup with the virus writers. It takes time for a new virus to be analysed and a signature created and distributed. During this window viruses can spread.
There is a very small chance that a virus signature will match a file that does not contain a virus. This would lead to the e-mail being rejected

Optical character recognition

We have developed our own OCR based spam scanning and have integrated it with spamassasin.

The effectiveness of pattern matching has driven some spammers to avoid using words so they send pictures instead. By running these pictures through an Optical Character Recognition (OCR) program and applying pattern matching to the output it is possible to detect spam features in the pictures.

Image statistical characteristics

We have developed our own statistical measures for detecting images used in spam messages and have integrated it with spamassasin.

In the continuing war between spam senders and ISPs. The spammers have discovered that distorting their images and ading speckles and other “noise” to their images reduces the effectiveness of OCR. However, some statistical properties are still more common in spam images than in other images. These include measures of the information density of the image (the number of bytes required to represent a pixel) and that spam images tend not to have extreme aspect ratios (spam is rarely a few pixels high and a screen wide unlike section separators). Although these measures are not definitive they can be used in combinations with other measures to improve detection.

Spam traps

A spam trap is an e-mail address which never gets sent real e-mail. Thus any message in the spam trap must be spam. These messages are matched against other e-mail messages sent to the system and any message that matches significantly is rejected.

Spam Defence at the Recipient

Because only the final recipient of an e-mail can be the true arbiter of whether the mail is desired or not, personal anti spam tools can can make better choices for the individual than tools at the ISP.

Personal Virus Filters

Every e-mail recipient should have an up to date virus scanner on their computer. The chances are high that the virus scanner will be different to the ones used at the ISP providing defence in depth for their PC. Also by scanning your own files you get a second chance to catch viruses which have slipped through the service providers virus scanners.

Personal Mail filters

There are many commercial products that filter e-mail that you can run on your own machine. Some of these are built into e-mail viewers.

Learning mail filters

There are a class of e-mail filters that can learn about an individuals preferences. These tools are taught a user preferences, by marking the mail as spam or desired e-mail. Various mathematical models can be used on the collection of good and spam e-mail. When a new e-mail is presented, the model is applied, and the model marks the mail as either good or spam. Some models include:

Bayesian Analysis
Linguistic Analysis