Machine Translation
As a programmer, moreover one who has made a stab at a program to inflect Latin words , I’m naturally quite interested in the field of machine translation. It’s one of those things, like playing Chess, that just seems as if it ought to be tricky, but do-able, at least until you really start trying to do it. Also like Chess, it seems like some of the most promising approaches in terms of a successful machine, rather than deeper understanding of how humans do it, is to take as much advantage as possible of what machines are good at doing: crunching numbers.
According to the old (possibly apocryphal) story, one of the first machine translation programs translated the saying “The spirit is willing, but the flesh is weak” to Russian, and then back as “The meat is good, but the vodka is rotten.”
Today, BabelFish gives us:
Дух охотно готов, но плоть �?лаба
Spirit is willingly ready, but flesh is weak
(I can’t read the Cyrillic at all, but the English isn’t bad at all compared to the story.)
Language Weaver represents an interesting approach to machine translation that departs radically from attempting to program a sophisticated model of the language’s underlying grammar. What the approach that Language Weaver takes (following pioneering work by IBM and certain Japanese groups) does is to start with a large corpus of texts in the source language that have already been translated by humans into the target language…and then to crunch numbers to produce a set of probabilities mapping runs in the one that correspond to runs in the other (the following is from the Language Weaver web site):
The USC/ISI research team led by Dr. Kevin Knight and Dr. Daniel Marcu has developed a new, statistical/cryptographic approach to the automatic translation of human languages. In contrast to current commercial machine translation systems, the statistical translation system uses techniques that automatically learn how texts can be translated from one language into another. All that Language Weaver’s statistics-based translation engine requires for “learning” is a large collection of sentence pairs that are mutual translations of each other. Language Weaver learns the translation patterns for every word and phrase in the training data. It can then use those patterns to translate new text of the same type.
The statistical basis of the translation engine, and its potential for commercial success, are analogous to the technology behind today’s commercial speech recognition systems. We believe that statistics-based automatic translation will be the breakout product in the automatic translation market just as it has been in speech recognition. These advances will change the nature of translation, and are a decisive step toward pervasive real-time conversion of textual information between languages.
This harkens back to an idea I’d run across before, which apparently originated with Warren Weaver for treating a foreign language as a code to break, except using modern cryptanalysis methods and computing power.
“One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”
- Warren Weaver, March 1947 (another interesting tidbit from their website, also referenced here)
Monday, June 7th, 2004