Archive for October, 2004

Spelling Reform

Language Log: More on Spelling Reform

I’ve been meaning to get around to this for a while, but on reading Bill Poser’s take on spelling reform, I think that the objection that Via de Argilla raises that historical spelling (or at any rate a not-completely-phonetic spelling) alleviates regional variation is more potent than Poser allows.
Take, for instance Language Log’s discussion of NEW-clee-er vs. NEW-cue-ler
While a reformed English spelling might very well get away with eliminating kn, which nobody pronounces with the k anymore except to be funny1, I think that if a completely phonetic spelling system were pushed through there would be genuine problems of the newcleeer/newcueler variety. One variant would be pushed out, and suddenly it’s just another non-phonetic system.
I’m not sure that care in creating the writing system will really solve these problems.

  1. and that hearkens back to the objection that Bill Poser mentioned before that those who learn after the reform will have difficulty reading older writing unless it’s reprinted for them. Do we really want to create a world where nobody has any idea why the French night taunting Arthur and his companions pronounces the word with both k and g sounds? I say thee nay.

Friday, October 29th, 2004

Logomacy

Apropos of hapax logomena, logomacy is one. Or at least it was until a handful of other bloggers used it as the proper name for this blog. As it stands, all google hits on logomacy point to this site or a site referring to this one.

As the subtitle of the blog hints logomachy and logomancy are words you can find in dictionaries: logomachy is arguing or disputing about words or a battle of words, while logomancy is divination by words. And logomacy, which sounds like it ought to mean something, is just a pun on my name that falls lexically between the two.

Friday, October 29th, 2004

Hapax Legomena and Spam

It used to be that hapax legomena were mostly something of interest to linguists and other word-freaks–after all, what use is a word that doesn’t occur anywhere else (and thus is often of uncertain meaning)? Well, if you’re a spamming low-life, then with what passes for ingenuity among your kind you might think that if people start filtering out words like “viagra” and “cock” in the subject header, you could substitute “v1agra” or “cokc” and get your evil missives through. How can a list of disallowed words possibly guard against hapax legomena?

It turns out, though, that there are better ways to filter than against obvious “spammy” words like viagra and cialis. Paul Graham in A Plan For Spam wrote

I think it’s possible to stop spam, and that content-based filters are the way to do it. The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognizes their messages, there is no way they can get around that.

Using a pseudo-Bayesian probabilistic approach, Graham’s plan calls for a user to train the filter by classifying each message in a corpus of mail received as spam or not-spam. Tim Peters, who worked on the Python implementation of this approach, called SpamBayes, took to calling the not-spam “ham” and the name seems to have stuck in the anti-spam community. The filter breaks all the messages apart into words (defined in this case as any run of whitespace or punctuation separated text) and then ranks the words as to their spaminess and hamminess (the extent to which the mere presence of the word in a message is a good predictor of whether the message is spam or ham). A weighted aggregate score is computed for all the words in the message, and the filter classifies it as spam, ham, or not-sure (roughly equal ham and spam scores). Because of the need to communicate, and in particular to get you to visit a web-page or click on a link to sell you stuff, for any given person certain words are found in almost all spam messages but in almost no real messages (e.g. “cheap” and “click”, or words with numbers in them). Words that are commonly found in both types of messages, such as your name, or articles and prepositions, end up with a middling score that basically doesn’t change the final result. The Graham approach has proved to be remarkably accurate, often getting no false positives or false negatives after only a week or two of training; I don’t think anyone has reported that they never get any unsures, no matter how much training is done, but that’s to be expected. Most spam when examined statistically over the whole body of the message leaves lots of clues to its spammy nature.

So where do hapax legomena come in with Bayesian spam filtering? They don’t. One of the truly nifty things about the scheme is that by definition, any word that has never been seen before counts as .5: neither spammy nor hammy, and has no effect on the ultimate rating of the message. So the spammer’s trick of making a visually similar hapax legomenon is foiled (as is the other trick of padding the message with words unrelated to the spam to try to lower the score–unless the words happen to be hammy for your particular corpus they won’t budge the score at all.) But as soon as the message is classified one way or the other, based on other clues in the message or by the user if the message rates an unsure, upon training that hapax legomenon becomes a clue for the type of message it is. So v1agra becomes at least a mild clue for spam, while once your uncle Ignatz writes you his name becomes at least a mild clue that the message is ham.

The SpamBayes Project has a nice discussion of how this all works.

Thursday, October 28th, 2004

Hapax Legomenon

A hapax legomenon is a word or phrase that occurs only once in a given corpus (usually an entire language, but sometimes in a particular text, or the work of a particular author). They are often found in dead languages, but my friend badger may have found one in Spanish: parracial

It apparently occurs in a poem by Pablo Neruda:

La parracial rosa devora
y sube a la cima del santo:
con espesas garras sujeta
el tiempo al fatigado ser:
hincha y sopla en las venas duras,
ata el cordel pulmonar, etonces
llargamente escucha y respira.

It doesn’t appear in any of the Spanish dictionaries that she consulted (or in any of the online ones that I looked at), and a Google search turns up 7 hits: 2 hits to her blog mentioning her search, 2 hits to another blog referring to her blog, 1 hit to the poem itself, and 2 to an essay about Neruda.

Thursday, October 28th, 2004

The Language Museum

The Language Museum is an interesting little site that attempt to realize the Language Museum proposed in Bodmer’s _The Loom of Language_ as a website. There you can look at word list of various languages, organized by family, side-by-side (with English translations).

Thursday, October 28th, 2004