Hapax Legomena and Spam

It used to be that hapax legomena were mostly something of interest to linguists and other word-freaks–after all, what use is a word that doesn’t occur anywhere else (and thus is often of uncertain meaning)? Well, if you’re a spamming low-life, then with what passes for ingenuity among your kind you might think that if people start filtering out words like “viagra” and “cock” in the subject header, you could substitute “v1agra” or “cokc” and get your evil missives through. How can a list of disallowed words possibly guard against hapax legomena?

It turns out, though, that there are better ways to filter than against obvious “spammy” words like viagra and cialis. Paul Graham in A Plan For Spam wrote

I think it’s possible to stop spam, and that content-based filters are the way to do it. The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognizes their messages, there is no way they can get around that.

Using a pseudo-Bayesian probabilistic approach, Graham’s plan calls for a user to train the filter by classifying each message in a corpus of mail received as spam or not-spam. Tim Peters, who worked on the Python implementation of this approach, called SpamBayes, took to calling the not-spam “ham” and the name seems to have stuck in the anti-spam community. The filter breaks all the messages apart into words (defined in this case as any run of whitespace or punctuation separated text) and then ranks the words as to their spaminess and hamminess (the extent to which the mere presence of the word in a message is a good predictor of whether the message is spam or ham). A weighted aggregate score is computed for all the words in the message, and the filter classifies it as spam, ham, or not-sure (roughly equal ham and spam scores). Because of the need to communicate, and in particular to get you to visit a web-page or click on a link to sell you stuff, for any given person certain words are found in almost all spam messages but in almost no real messages (e.g. “cheap” and “click”, or words with numbers in them). Words that are commonly found in both types of messages, such as your name, or articles and prepositions, end up with a middling score that basically doesn’t change the final result. The Graham approach has proved to be remarkably accurate, often getting no false positives or false negatives after only a week or two of training; I don’t think anyone has reported that they never get any unsures, no matter how much training is done, but that’s to be expected. Most spam when examined statistically over the whole body of the message leaves lots of clues to its spammy nature.

So where do hapax legomena come in with Bayesian spam filtering? They don’t. One of the truly nifty things about the scheme is that by definition, any word that has never been seen before counts as .5: neither spammy nor hammy, and has no effect on the ultimate rating of the message. So the spammer’s trick of making a visually similar hapax legomenon is foiled (as is the other trick of padding the message with words unrelated to the spam to try to lower the score–unless the words happen to be hammy for your particular corpus they won’t budge the score at all.) But as soon as the message is classified one way or the other, based on other clues in the message or by the user if the message rates an unsure, upon training that hapax legomenon becomes a clue for the type of message it is. So v1agra becomes at least a mild clue for spam, while once your uncle Ignatz writes you his name becomes at least a mild clue that the message is ham.

The SpamBayes Project has a nice discussion of how this all works.

Comments are closed.