Machine Translation

As a programmer, moreover one who has made a stab at a program to inflect Latin words , I’m naturally quite interested in the field of machine translation. It’s one of those things, like playing Chess, that just seems as if it ought to be tricky, but do-able, at least until you really start trying to do it. Also like Chess, it seems like some of the most promising approaches in terms of a successful machine, rather than deeper understanding of how humans do it, is to take as much advantage as possible of what machines are good at doing: crunching numbers.

According to the old (possibly apocryphal) story, one of the first machine translation programs translated the saying “The spirit is willing, but the flesh is weak” to Russian, and then back as “The meat is good, but the vodka is rotten.”

Today, BabelFish gives us:
Дух охотно готов, но плоть �?лаба

Spirit is willingly ready, but flesh is weak

(I can’t read the Cyrillic at all, but the English isn’t bad at all compared to the story.)

Language Weaver represents an interesting approach to machine translation that departs radically from attempting to program a sophisticated model of the language’s underlying grammar. What the approach that Language Weaver takes (following pioneering work by IBM and certain Japanese groups) does is to start with a large corpus of texts in the source language that have already been translated by humans into the target language…and then to crunch numbers to produce a set of probabilities mapping runs in the one that correspond to runs in the other (the following is from the Language Weaver web site):

The USC/ISI research team led by Dr. Kevin Knight and Dr. Daniel Marcu has developed a new, statistical/cryptographic approach to the automatic translation of human languages. In contrast to current commercial machine translation systems, the statistical translation system uses techniques that automatically learn how texts can be translated from one language into another. All that Language Weaver’s statistics-based translation engine requires for “learning” is a large collection of sentence pairs that are mutual translations of each other. Language Weaver learns the translation patterns for every word and phrase in the training data. It can then use those patterns to translate new text of the same type.

The statistical basis of the translation engine, and its potential for commercial success, are analogous to the technology behind today’s commercial speech recognition systems. We believe that statistics-based automatic translation will be the breakout product in the automatic translation market just as it has been in speech recognition. These advances will change the nature of translation, and are a decisive step toward pervasive real-time conversion of textual information between languages.

This harkens back to an idea I’d run across before, which apparently originated with Warren Weaver for treating a foreign language as a code to break, except using modern cryptanalysis methods and computing power.

“One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”
- Warren Weaver, March 1947 (another interesting tidbit from their website, also referenced here)

2 Responses to “Machine Translation”

  1. Peter Says:

    Actually, statistical machine translation is around since the 80s. The problem is that there are usually no large
    amount of parallel texts in two languages available, and if so, usually not in an electronic form. In bilingual
    countries like Canada or Belgium, there are parallel legal texts and parliament proceedings, and also the European
    Union has a lot of those, but it’s very hard to find general text.
    to find large amounts

  2. Joshua Macy Says:

    Right, I didn’t mean to make Language Weaver sound revolutionary. The page I pointed to about the history of MT talked about IBM’s Candide system, which was the initiated in the ’80s. Still, LW is the first system to escape the laboratory that I’ve heard of, which I think is interesting.

  • Some of my Books

  •  

    June 2004
    M T W T F S S
    « May   Jul »
     123456
    78910111213
    14151617181920
    21222324252627
    282930  
  • Categories

  • Recent Comments

  • Meta