Archive for June, 2004

Machine Translation

As a programmer, moreover one who has made a stab at a program to inflect Latin words , I’m naturally quite interested in the field of machine translation. It’s one of those things, like playing Chess, that just seems as if it ought to be tricky, but do-able, at least until you really start trying to do it. Also like Chess, it seems like some of the most promising approaches in terms of a successful machine, rather than deeper understanding of how humans do it, is to take as much advantage as possible of what machines are good at doing: crunching numbers.

According to the old (possibly apocryphal) story, one of the first machine translation programs translated the saying “The spirit is willing, but the flesh is weak” to Russian, and then back as “The meat is good, but the vodka is rotten.”

Today, BabelFish gives us:
Дух охотно готов, но плоть �?лаба

Spirit is willingly ready, but flesh is weak

(I can’t read the Cyrillic at all, but the English isn’t bad at all compared to the story.)

Language Weaver represents an interesting approach to machine translation that departs radically from attempting to program a sophisticated model of the language’s underlying grammar. What the approach that Language Weaver takes (following pioneering work by IBM and certain Japanese groups) does is to start with a large corpus of texts in the source language that have already been translated by humans into the target language…and then to crunch numbers to produce a set of probabilities mapping runs in the one that correspond to runs in the other (the following is from the Language Weaver web site):

The USC/ISI research team led by Dr. Kevin Knight and Dr. Daniel Marcu has developed a new, statistical/cryptographic approach to the automatic translation of human languages. In contrast to current commercial machine translation systems, the statistical translation system uses techniques that automatically learn how texts can be translated from one language into another. All that Language Weaver’s statistics-based translation engine requires for “learning” is a large collection of sentence pairs that are mutual translations of each other. Language Weaver learns the translation patterns for every word and phrase in the training data. It can then use those patterns to translate new text of the same type.

The statistical basis of the translation engine, and its potential for commercial success, are analogous to the technology behind today’s commercial speech recognition systems. We believe that statistics-based automatic translation will be the breakout product in the automatic translation market just as it has been in speech recognition. These advances will change the nature of translation, and are a decisive step toward pervasive real-time conversion of textual information between languages.

This harkens back to an idea I’d run across before, which apparently originated with Warren Weaver for treating a foreign language as a code to break, except using modern cryptanalysis methods and computing power.

“One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”
- Warren Weaver, March 1947 (another interesting tidbit from their website, also referenced here)

Monday, June 7th, 2004

Martialis

Martialis is a blog devoted to the epigrams of Martial. Is that cool, or what?

Thanks to languagehat for the link

Monday, June 7th, 2004

Tension, Apprehension, and Dissention have begun

Tensor said the Tensor is a cool language blog that I just discovered by following back blogs that have linked to logomacy. Reading the author’s about me section makes me wonder whether he’s my Mirror Universe twin. Which means that since I don’t have a goatee…

Monday, June 7th, 2004

A man after my own heart

As you are reading these words you are taking part in one of the wonders of the natural world. For you and I belong to a species with a remarkable ability: we can shape events in each other’s brains with exquisite precision. I am not referring to telepathy or mind control or the other obsessions of fringe science; even in the depictions of believers these are blunt instruments compared to an ability that is uncontroversially present in every one of us. That ability is language. Simply by making noises with our mouths, we can reliably cause precise new combinations of ideas to arise in each other’s minds. The ability comes so naturally that we are apt to forget what a miracle it is. – Steven Pinker, The Language Instinct

Neat! It particularly resonates with me, since I’ve said much the same thing myself, both on the opening page of this site and in this piece on http://www.webamused.com (including the comparison to claims of telepathy).

Monday, June 7th, 2004

Unfortunate algorithms

Atom is a format design to provide “feeds” from blogs–short excerpts and headlines pointing back to the original posts. (RSS is another, competing, format.) When it comes time to actually program the software to produce the feed from the blog post, one of the decisions the programmer has to make is where to cut off longer headlines. In the case of this post from OxBlog, the programmer apparently chose to just chop it as soon as some character limit had been reached, leading to the following headline in the feed:

“THANKS ANYWAY, I THINK I’LL OPT FOR TELLY AND A HO…”

Of course, despite the popularity of Rap music in Britain^1^, there are probably comparatively few English speakers who would say both “telly” for television and “ho” for ladies of negotiable affection, but it still gave me pause.

The actual headline, by the way, is “THANKS ANYWAY, I THINK I’LL OPT FOR TELLY AND A HOT-WATER BOTTLE.”

How much harder would it have been to program the feed to either proceed to the next word break, or back track to the nearest prior? It seems particularly silly since the program adds the ellipsis, and those three characters would have been more than enough to complete the word.

p. 1 – at least popular enough for British Home Office Secretary David Plunkett and Culture Minister Kevin Howells to denounce it last year.

Monday, June 7th, 2004

Clintonius Maximus

MemeFirst – Clintonius Maximus

Regnum Clintoni benignus erit. In rebus domesticis deficit reductio fecit, Securitatis Socialis salvatus, gigantum mercatus taurus supervisus est, ‘novus economicus’ salutavit. Simul, in rebus internationalismus bonus erit. Multilateralismus conducit. Cum amico intimo Antonius Blair, primus ministrum britannicus, Viam Tertiam creavit.

Sed Eheu! Magnum disastrum suscepit sua maxima culpa. Per noctem, Novembre MCMLXXXXV Alia Occidentalis Domus Albus laborante, sibi pizza donata est a Monica Lewinsky, puella pulchrissima, sensuosa californicante, fellatrix superiore._

Omnia res lege

Monday, June 7th, 2004

Me and Mrs. Malaprop

Sometimes whether something’s an eggcorn or just a mistake is a bit harder to tell (at least for me).
For instance, in free reign (Google ratio: 1.16!) the folk-etymology seems clear. On the other hand, rein of terror (Google ratio: 217) may just be a dropped letter typo, along the lines of let lose (GR: 75).

Saturday, June 5th, 2004

English As She Is Spoke

What can two intrepid translators, who don’t know the target language, and don’t have a dictionary of their language and the target language, but do have a dictionary of their language and a third language and that third language and the target language accomplish? Well, they can serve as a warning to others. José da Fonesca and Pedro Carolino were the two Portuguese gentlemen who, in 1855, armed with a Portuguese-French and a French-English dictionary produced a Portuguese-English phrasebook.
I’ve just picked up a nifty new edition of *English as She is Spoke*, as the result came to be known, and it certainly is something else.

Não podêmos ouvír nos.
Do not might one’s understand to speak.

Gásta-se múita lênha n’éssa cása.
One’s make us very much of the wood in that house there.

I’m not even sure what the intent of that phrase was.

On the other hand, I can’t wait to use
Quê negócio vó ôu ô demorôo?
What business has staced you?

As soon as I can figure out a suitable meaning for stace. I’m thinking something like, “This project is so staced.”

For an updated take on it, check out English As She Is Spoke vs. Babelfish!.

Saturday, June 5th, 2004

The Siege Of Belgrade

An Austrian army, awfully array’d,
Boldly by battery besiege Belgrade;
Cossack commanders cannonading come,
Deal devastation’s dire destructive doom;
Ev’ry endeavour engineers essay,
For fame, for freedom, fight, fierce furious fray.
Gen’rals ‘gainst gen’rals grapple–gracious God!
How honors Heav’n heroic hardihood!
Infuriate, indiscriminate in ill,
Just Jesus, instant innocence instill!
Kinsmen kill kinsmen, kindred kindred kill.
Labour low levels longest, loftiest lines;
Men march ‘midst mounds, motes, mountains, murd’rous mines.
Now noisy, noxious numbers notice nought,
Of outward obstacles o’ercoming ought;
Poor patriots perish, persecution’s pest!
Quite quiet Quakers “Quarter, quarter�? quest;
Reason returns, religion, right, redounds,
Suwarrow stop such sanguinary sounds!
Truce to thee, Turkey, terror to thy train!
Unwise, unjust, unmerciful Ukraine!
Vanish vile vengeance, vanish victory vain!
Why wish we warfare? wherefore welcome won
Xerxes, Xantippus, Xavier, Xenophon?
Yield, ye young Yaghier yeomen, yield your yell!
Zimmerman’s, Zoroaster’s, Zeno’s zeal
Again attract; arts against arms appeal.
All, all ambitious aims, avaunt, away!
Et cætera, et cætera, et cætera.

Bartlett’s has the source as “Miscellaneous,” but at least two other sites attribute it to Alaric Alexander Watts (1797- 1864) without quoting it in full. I first came across it in a slightly different form when I was about eleven in a book about puzzles and word play, and promptly committed it to memory. The version I learned was slightly different, lacking any J’s (I’m pretty sure that it’s not just my faulty memory, since I recall the accompanying text mentioning that J was the only missing letter), and with several other differences, e.g. the line for P being “Poor patriots, partly purchased, partly pressed” The Barlett’s text has a note, which suggests that the versions may have diverged quite early on:

These lines having been incorrectly printed in a London publication, we have been favoured by the author with an authentic copy of them. –Wheeler’s Magazine, vol. i. p. 244. (Winchester, England, 1828.)

Friday, June 4th, 2004

Searching for Eggcorns

I am Internally grateful to Language Log for introducing me to the concept of eggcorns.

Searching for eggcorns is indeed a hard road to hoe. Sometimes you need to “take another tact“: and learn to “tow the line.”

Friday, June 4th, 2004

  • Some of my Books

  •  

    June 2004
    M T W T F S S
    « May   Jul »
     123456
    78910111213
    14151617181920
    21222324252627
    282930  
  • Categories

  • Recent Comments

  • Meta