Archive for June 7th, 2004

Machine Translation

As a programmer, moreover one who has made a stab at a program to inflect Latin words , I’m naturally quite interested in the field of machine translation. It’s one of those things, like playing Chess, that just seems as if it ought to be tricky, but do-able, at least until you really start trying to do it. Also like Chess, it seems like some of the most promising approaches in terms of a successful machine, rather than deeper understanding of how humans do it, is to take as much advantage as possible of what machines are good at doing: crunching numbers.

According to the old (possibly apocryphal) story, one of the first machine translation programs translated the saying “The spirit is willing, but the flesh is weak” to Russian, and then back as “The meat is good, but the vodka is rotten.”

Today, BabelFish gives us:
Дух охотно готов, но плоть �?лаба

Spirit is willingly ready, but flesh is weak

(I can’t read the Cyrillic at all, but the English isn’t bad at all compared to the story.)

Language Weaver represents an interesting approach to machine translation that departs radically from attempting to program a sophisticated model of the language’s underlying grammar. What the approach that Language Weaver takes (following pioneering work by IBM and certain Japanese groups) does is to start with a large corpus of texts in the source language that have already been translated by humans into the target language…and then to crunch numbers to produce a set of probabilities mapping runs in the one that correspond to runs in the other (the following is from the Language Weaver web site):

The USC/ISI research team led by Dr. Kevin Knight and Dr. Daniel Marcu has developed a new, statistical/cryptographic approach to the automatic translation of human languages. In contrast to current commercial machine translation systems, the statistical translation system uses techniques that automatically learn how texts can be translated from one language into another. All that Language Weaver’s statistics-based translation engine requires for “learning” is a large collection of sentence pairs that are mutual translations of each other. Language Weaver learns the translation patterns for every word and phrase in the training data. It can then use those patterns to translate new text of the same type.

The statistical basis of the translation engine, and its potential for commercial success, are analogous to the technology behind today’s commercial speech recognition systems. We believe that statistics-based automatic translation will be the breakout product in the automatic translation market just as it has been in speech recognition. These advances will change the nature of translation, and are a decisive step toward pervasive real-time conversion of textual information between languages.

This harkens back to an idea I’d run across before, which apparently originated with Warren Weaver for treating a foreign language as a code to break, except using modern cryptanalysis methods and computing power.

“One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”
- Warren Weaver, March 1947 (another interesting tidbit from their website, also referenced here)

Monday, June 7th, 2004

Martialis

Martialis is a blog devoted to the epigrams of Martial. Is that cool, or what?

Thanks to languagehat for the link

Monday, June 7th, 2004

Tension, Apprehension, and Dissention have begun

Tensor said the Tensor is a cool language blog that I just discovered by following back blogs that have linked to logomacy. Reading the author’s about me section makes me wonder whether he’s my Mirror Universe twin. Which means that since I don’t have a goatee…

Monday, June 7th, 2004

A man after my own heart

As you are reading these words you are taking part in one of the wonders of the natural world. For you and I belong to a species with a remarkable ability: we can shape events in each other’s brains with exquisite precision. I am not referring to telepathy or mind control or the other obsessions of fringe science; even in the depictions of believers these are blunt instruments compared to an ability that is uncontroversially present in every one of us. That ability is language. Simply by making noises with our mouths, we can reliably cause precise new combinations of ideas to arise in each other’s minds. The ability comes so naturally that we are apt to forget what a miracle it is. - Steven Pinker, The Language Instinct

Neat! It particularly resonates with me, since I’ve said much the same thing myself, both on the opening page of this site and in this piece on http://www.webamused.com (including the comparison to claims of telepathy).

Monday, June 7th, 2004

Unfortunate algorithms

Atom is a format design to provide “feeds” from blogs–short excerpts and headlines pointing back to the original posts. (RSS is another, competing, format.) When it comes time to actually program the software to produce the feed from the blog post, one of the decisions the programmer has to make is where to cut off longer headlines. In the case of this post from OxBlog, the programmer apparently chose to just chop it as soon as some character limit had been reached, leading to the following headline in the feed:

“THANKS ANYWAY, I THINK I’LL OPT FOR TELLY AND A HO…”

Of course, despite the popularity of Rap music in Britain^1^, there are probably comparatively few English speakers who would say both “telly” for television and “ho” for ladies of negotiable affection, but it still gave me pause.

The actual headline, by the way, is “THANKS ANYWAY, I THINK I’LL OPT FOR TELLY AND A HOT-WATER BOTTLE.”

How much harder would it have been to program the feed to either proceed to the next word break, or back track to the nearest prior? It seems particularly silly since the program adds the ellipsis, and those three characters would have been more than enough to complete the word.

p. 1 - at least popular enough for British Home Office Secretary David Plunkett and Culture Minister Kevin Howells to denounce it last year.

Monday, June 7th, 2004

Clintonius Maximus

MemeFirst - Clintonius Maximus

Regnum Clintoni benignus erit. In rebus domesticis deficit reductio fecit, Securitatis Socialis salvatus, gigantum mercatus taurus supervisus est, ‘novus economicus’ salutavit. Simul, in rebus internationalismus bonus erit. Multilateralismus conducit. Cum amico intimo Antonius Blair, primus ministrum britannicus, Viam Tertiam creavit.

Sed Eheu! Magnum disastrum suscepit sua maxima culpa. Per noctem, Novembre MCMLXXXXV Alia Occidentalis Domus Albus laborante, sibi pizza donata est a Monica Lewinsky, puella pulchrissima, sensuosa californicante, fellatrix superiore._

Omnia res lege

Monday, June 7th, 2004