Characterizing Linguistic Divergences in Machine Translation

This project is a comparative investigation of machine translation's successeses and non-successes. The first chapter in the Slovak, German, and French copies of J.K. Rowling's Harry Potter and the Sorcerer's Stone was processed through Google Translate into the target language English. The linguistic "divergences" between the three products were then examined.

English | a target language

The term source language will be used to refer to the language translated. The term target language will be used to refer to output language, i.e., the intended language of the translation. First of all, processing three languages into one, rather than processing one language into three made more sense for the purposes of our project. We wanted to judge the source language's impact on Google Translate quality, and by using one target language, this established a baseline for comparison.

We chose English as the target language firstly for practical reasons. All three project authors speak English fluently. Furthermore, the English language is integral to how Google Translate operates. Google Translate uses English as an intermediary language, for example, in going from Hebrew to Russian, or from Swahili to Marathi.

As for choosing our source languages, between the three of us, we had studied French, German, and Slovak. It so happened that these are three European languages in different subfamilies of the Indo-European language family. English likewise is in this language family, in the same subfamily as German. Thus the three languages are related to English although at increasing distances. German, of course, is the closest, followed by French, then finally Slovak. This language framework invited us to ponder the question, on the basis that linguistic relatedness implies greater structural similiarity between two languages, will languages more closely related to English be translated by a machine into English more successfully?

Harry Potter | choosing a text

In this investigation, we analyzed the first chapter from J.K. Rowling's first book in her Harry Potter series, Harry Potter and the Sorcerer's Stone. This text was selected for several reasons. First of all, based on our research, it is likely that the majority of corpora used in Google Translate processing involves modern language. Furthermore, Google Translate users are likely processing modern language texts, for instance, e- mails, messages, website excerpts. A text which was a classic or otherwise more technical or academic in nature would be written in either antiquated or unnatural language. Analyzing such a text might drastically impact our results and would be less relevant to what average users expect from Google Translate as a product or tool.

Another appeal of Harry Potter is that the seies has been so widely translated. This is a comparative study. Therefore, the more languages that a text is available in, the broader the potential scope of our project may be. Moreover, Harry Potter translations are actually genuinely meaningful to their readerships. Because Harry Potter is read for enjoyment, people expect to and happily read these books in their native language, i.e., translated copies themselves are celebrated and beloved. In contrast, literary or technical writing is of most pertinence to academics. Amongst these circles, there is more merit in reading such texts in their original languages, i.e., not translated copies, and therefore translated copies, though they have a practical purpose, have less professional substance.

Errors? | grammaticality versus mistranslation

The term "divergences" was coined to capture the nature of the elements we intended to analyze in Google translated English texts. Although the term "error" can be used to some extent in describing these elements, it does not capture the full scope of our analysis. Consider the following sentences.

Imagine Google translated the Slovak sentence 1 into the English sentence 2. Note that sentence 2 does not contain any errors in the sense that the sentence is ungrammatical in anyway. However, Slovak torta means cake not dessert. Although the two words are semantically related, this is a mistranslation.

Ultimately, the term "errors" suggests that in this project we simply proofread and corrected ungrammatical text. While accounting for ungrammatical structures was a component of our analysis, the term "divergence" conveys how the elements which we chose to analyze in this project may have lost either grammaticality or semantic quality (or some combination of the two) post Google Translate processing. Furthermore, it emphasizes that we are comparing how these elements manifest across different languages when these languages are translated into English.

Mistranslations | what doesn't count

Translation is more an art than a science because one to one grammatical and semantic correspondence does not exist between languages. In English, it is proper to say, "I am on the wrong bus." In Slovak, the word "wrong" doesn't exist in this sense, and it would be more common to use the word zlý meaning "bad." therefore, if google translated wrong bus as zlý autobus, the semantic quality of the original phrase is retained, and this would not be considered a mistranslation, at least not within the scope of this project. This is a one word example, but this aspect of feature can manifest further, such as on a sentence level. A concept easily conveyed in one sentence in one language might require two in another. This is one reason why the number of sentences between or German, Slovak, and French texts did not correspond, a point that we needed to be mindful of in developing a schema which captured how machine translation choices diverged across the three languages.

Schema & Markup | <txt><p><s>

The three Google translated English Harry Potter texts were put into individual XML files. Content was chunked into paragraphs and then sentences. Note that the number of paragraphs corresponded between the Slovak, German, and French versions, but again, the number of sentences did not. Each of the three authors then examined their assigned version for divergences in relation to the original, source texts at both the sentence and word level. The markup described below was constrained by a schema developed in RELAXNG.

Transpositions | <s tr="0"

The tag "s" marks a setence, and "tr" stands for transposition. This term refers to a change the relative position of an element. Note that in English, proper word or phrase order is often paramount for grammaticality and meaning. He books read good is ungrammatical, though perhaps it's meaning is still recoverable ("He read good books"). On the other hand, "Sally tickled John" has a differently meaning entirely if rendered as "John tickled Sally."

Boolean values were used to mark whether or not a sentence was transposed relative to what would have been both grammatical and semantically valid according to the source language. A value of "0" or "False" was assigned to properly ordered sentences. A value of "1" or "True" was assigned to misordered sentences

Divergences | <difs id="34"

Divergences were marked on word level by the tag "dif." All difs were assigned ID numbers which corresponded with the order in which they appeared in the document. The ID number was marked with the intention that it be used in future projects, which we anctipated involving corresponding markup of the source document. See the discussion for elaboration. Beyond this ID number, divergences were also tagged for part of speech and type, as defined below.

Part of Speech | pos

Our schema accounted for 8 parts of speech. Every dif was required to be marked as only one of these 8 options.

="det"

"Determiners" are commonly referred to as articles in English. Many nouns in English require a preceding article, depending on their position in the sentence. There are definite and indefinite articles, and sometimes articles have to accurately reflect plurality and/or special or mental distance.

e.g. "the cat," "a mouse," "this tomato"

="n"

Nouns refer to peoples, places, things, and ideas. There are common and proper nouns, for instance city versus King's Cross. English nouns must be marked for singularity versus plurality.

e.g. "He and Hermione walked," "They went to Diagon Alley"

="v"

Verbs express action, e.g., running, or state of being, e.g., exists. There are also helping verbs that encode grammatical features, e.g., "He was walking. English verbs must reflect proper person, plurality, tense, aspect, and mood.

e.g. "Harry Potter flew on a broomstick," "He will go to Hogwarts"

="adj"

Adjectives are words which describe nouns.

e.g. "Crookshanks was fluffy"

="adv"

Adverbs are words which describe verbs, i.e., action or state of being.

e.g. "He was not there," "The three headed dog growled vicisouly," "There was a cat"

="pp"

Prepositions describe a noun or pronoun's relative mental or spatial position.

e.g. "The cat sat in front of Privet Drive," "He is the headmaster of Hogwarts"

="pro"

Pronouns take the place of nouns. In English, they must express proper plurality, person, and gender.

e.g. "She summoned Dumbledore," "Who found Neville's toad?", "Her," "He wondered to himself"

="conj"

Conjuctions are connector words, either between words, phrases, or senses.

e.g. "Ron laughed but Harry didn't," "Severus Snape, who was the Potions master," "cats, toads, and owls"

Type | type

Difs were marked for their error type. A dif could be assigned only one error type. If a dif contained more than one error, it was marked as a mistranslation. This schema accounted for 11 possible dif types. Note that in the examples below, the italicized word is a hypothetical example of the dif type discussed. Also, typing errors required that we make grammatical judgements. These grammatical judgements, in the examples below and throughout our marked up texts, reflect our own shared American English grammars and dialects. This was simply a coding decion, and we acknowledge that different and equally valid grammars and dialects exist throughout the English language.

="tense"

Tense relays time information: past, present, future.

e.g. "He will go last night," "She was dances at the ball"

="case"

Case relays a pronoun or noun's grammatical function, for example, is the noun or pronoun the subject or direct object of the sentence.

e.g. "He talked to she," "Him cast a spell"

="num"

Number relays the correct plurality or singularity in a verb or noun or pronoun.

e.g. "We goes to Hogwarts," "The cat chase the mouse."

="gen"

Pronouns must reflect the proper gender.

e.g. "Snape talked to Dumbledore because she was the smartest professor at Hogwarts"

="mst"

A mistranslation occurs when two or more other error types are present in a single dif or when semantic integrity is lost.

e.g. "He ate a cactus," "He were going now"

="moo"

Mood relays a speaker's attitude about what he is saying, for example, if he thinks it is factual or not. The conditional mood is common in English.

e.g. "He will go to the ball if he had been allowed"

="asp"

Aspect specifies when in time a verb takes place. For example, within the English present tense, there is the present progressive tense. Within the past tense, there is the past perfect. There are many other examples.

e.g. "He spoken to him last week"

="per"

Verbs and pronouns should reflect the correct person, i.e., first, second, versus third person.

e.g. "I rides the broomstick," "They goes to Hogwarts"

="del"

Deletion errors occur where elements required to make a sentence grammatical were omitted.

e.g. "Ron talked to [the] professor"

="pos"

Part of speech errors occur where elements present with the correct meaning but aren't morphologically marked as the correct part of speech.

e.g. "Malfoy laughed cruel"

="ins"

Insertion errors occur when translation adds nonsequitor elements not in the source text.

e.g. "They were proud of himself that their daughter won"

Two or More Errors | in defense of <mst> marking

Marking elements containing more than two "grammatical errors," for example, tense and person, as a "mistranslation" was a coding decision made for two reasons. First of all, XML potentially impose hierarchy on difs where there shouldn't be. Consider a scenario where sentence 2 below is supposed to be a rendition of sentence 1.

The pronoun "she" in sentence 2 is supposed to stand for "John." The correct rendering would have been "him." Thus, this "dif" is incorrect in terms of gender and case. In one approach, "she" could either be coded in a structure such as <case><gender>she</gender></case> or <gender><case>she</case></gender>. Here, XML is ranking the two error types in either a way that is linguistically unnecessary and/or meaingless or in a way that what we would have to justify ("'Case' errors should be nested within 'gender' errors because...").

We might have assigned multiple type attributes within one element to instances like this. However, that begs the question, is a simultaneous "case" and "gender" error diferent from a simultaneous "case" and "number" error? This is something that would have required justification. Ultimately, the decision to mark difs presenting with multiple errors as "mistranslations" was made on semantic/syntactic grounds. Consider the following two sentences.

The three sentences differ only by one word: "he," versus "I," versus "she." Imagine that sentence 2 was the output produced by Google Translate, but according to the source document, the sentence output should have been rendered as sentence 1. "We" is the sole dif in the sentence, and it contains two types of errors: person and number. Within the framework of our schema implementation, this dif would be considered a "mistranslation."

Recall that "mistranslation" involves a loss of semantic integrity. Consider sentence 3, where "he" is rendered as "she," a rendering which involves just one error type: gender. In this instance, it is plausible that readers would understand that "she" should have been "he" and is actually referring back to "the boy." Importantly, it also plausible that readers would be able to glean this information from the context of the sentence alone. That is to say, with just one error within the dif, overall sentence meaning is more salvageable solely within the scope of the sentence than it would have been had the dif contained more errors. Certainly, in going between sentences 1 and 2, the overall meaning is altered more drastically. Readers would need context beyond the confines of the sentence in order to recover the fact that "I" is not in fact referring to the speaker but to "the boy."

Ultimately, the more errors within a "dif," the worse off its semantic integrity, thus providing grounds to deem it a mistranslation. There is further justification for considering the extent of a sentence to be a boundary for scope in terms of delineating the parameters of the context in which difs may be assessed for semantic salvageability. First of all, in our coding scheme, difs are being marked within sentences, not within groups of sentences or paragraphs. Secondly, this design choice is true to how Google Translate operates. It processes text on a sentence by sentence basis. Marking difs with two plus errors as a mistranslation is strong decision both linguistically and technically. Regardless, it should be noted that we found very few instances of multiple errors as we were coding.

Methodology

Explanation of our data and markup scheme