by Tom Thompson
We can trace the origin of machine translation (MT) back to 17th century ideas, and some efforts at mechanical dictionaries, but it’s not been until the 20th century, and the advent of computer technology that marked progress has been made. MT should now be called computer translation (CT).
Still, the whole effort has suffered from exaggerated claims and impossible expectations. Spoken language is too quick and fragmented for CT. Oral communication has too many false starts sand nothing to pick up tone of voice, cultural references, idiom and humor.
Monolinguals get excited about a public and free version of this or that translation service. But my own reaction is mostly a polite, multilingual groan, especially if I know either the source or the target language. Very simply, text is easier to translate than conversation, and it’s better suited to the technology.
While it’s a popular belief that CT for written text is making great strides, it’s more accurate that the progress has been slow going. A famous tale tells of a CT effort to translate Russian into English: “The vodka is good, but the meat is rotten” became “The spirit is willing but the flesh is weak.”
If you’re a devout believer in the future of CT, then your reaction is mostly hollow grin. The CT achievement is challenging and hard. The reasons are straightforward: A general property of all languages is the prevalence of ambiguity of individual words, as we’ll as in the relationship between parts of a sentence. In our first languages, we’re efficient at resolving ambiguities when interpreting linguistic input. But past experience and context are difficult to model in a computer program.
One CT approach has been based on linguistic rules according to the languages involved, which involves mostly information about the linguistics of both the source and target languages, always using morphological and syntactic rules, as well as semantic analysis of both languages.
A more recent development has been statistical machine translation (SMT), which has become the dominant framework of CT research. Statistical methods do not require researchers to know the languages involved in systems, and do not demand complex large-scale acquisition of rules and lexical data. The data focus instead is the growing availability of large monolingual and bilingual corpora. SMT relies on the notion that every language must describe a similar set of ideas, so the words that do this moist also be similar. The trick is to develop and refine the so-called “language space,” which can be thought of as a set of vectors that each point from one word to another. It turns out that different languages share many similarities in this vector space, which means the process of converting one language into another often becomes partly mathematics.
Many researchers are adopting “hybrid” approaches combining rules-based approaches with the statistical models. Google recently improved its internal translation capabilities by using nearly 200 billion words and phrases from United Nations and European Union materials to train their system. The Google model can learn the likelihood that “X” in language A will be translated as “Y” in language B. The theory is that the more data you feed in, the better the model’s statistical guesses get. These documents are full of legalese, but at least they don’t raise the hackles of copyright enforcers!
Even the hybrid efforts get a bad rap for not being “human quality,” but that standard seems to have been lowered by the Internet demands of instant communication. That the world is far from linguistically flat, however, is just reality.
That reality includes the fact that more than half the content on the Internet is in a language other than English, and that three out of four Internet users are not native speakers of English. There is a growing presence of native Chinese speaking users, which draws attention to the challenges of CT for Chinese and English.
An immediate difference from an alphabetic system is the much larger number of characters compared with the number of letters, even though the exact number of existing Chinese characters cannot be precisely identified. Full literacy requires only between 3,000 and 4,000 characters. There are simplified and traditional characters, as well as variant characters.
Word identification poses unusual problems. In English and most other languages, a spoken word is represented in writing by a string of letters delimited on both sides by white spaces. In Chinese, however, we cannot identify words in a similar fashion, because in Chinese writing, no white spaces are left between units of written script. Therefore, before morphological processing can take place, an additional step of segmentation is necessary, by which continuous strings of characters are cut into word chunks. Then, too, there are significant structural differences between English and Chinese, such as the different orderings of head nouns and relative clauses. In English, words, whether they are adjectives, groups of nouns or clauses, can come both before and after the noun.
In Chinese, the modifying elements virtually always come before the noun, the length of the modifying component can be quite long, and there is a wide range of noun modification constructions. I’ve yet to see a machine translation software correctly deciding where the set of modifying elements begins. So it’s not a surprise that English-Chinese paired CT has the worst results, even poorer than for other difficult language sets.
Tom Thompson writes often on foreign language topics. He lives in Washington, DC.