This project is a comparative investigation of machine translation's
successeses and non-successes. The first chapter in the Slovak, German, and French
copies of J.K. Rowling's Harry Potter and the Sorcerer's Stone was processed
through Google Translate into the target language English. The linguistic
"divergences" between the three products were then examined.
English | a target language
The term source
language will be used to refer to the language translated. The term target
language will be used to refer to output language, i.e., the intended
language of the translation. First of all,
processing three languages into one, rather than processing one language into three
made more sense for the purposes of our project. We wanted to judge the source
language's impact on Google Translate quality, and by using one target language,
this established a baseline for comparison. We chose English as the target language firstly for practical reasons. All
three project authors speak English fluently. Furthermore, the English language is
integral to how Google Translate operates. Google Translate uses English as an
intermediary language, for example, in going from Hebrew to Russian, or from Swahili to Marathi.
As for choosing our source languages, between the three of us, we had
studied French, German, and Slovak. It so happened that these are three European languages in different
subfamilies of the Indo-European language family. English likewise is in this language family, in the same
subfamily as German. Thus the three languages are related to English although at increasing distances.
German, of course, is the closest, followed by French, then finally Slovak. This language framework invited us to ponder the question,
on the basis that linguistic relatedness implies greater structural similiarity between two languages, will
languages more closely related to English be translated by a machine into English more successfully?
Harry Potter | choosing a text
In this investigation,
we analyzed the first chapter from J.K. Rowling's first book in her Harry
Potter series, Harry Potter and the Sorcerer's Stone. This text was
selected for several reasons. First of all, based on our research, it is likely that
the majority of corpora used in Google Translate processing involves modern
language. Furthermore, Google Translate users are likely processing modern language
texts, for instance, e- mails, messages, website excerpts. A text which was a
classic or otherwise more technical or academic in nature would be written in either
antiquated or unnatural language. Analyzing such a text might drastically impact our
results and would be less relevant to what average users expect from Google Translate as a product or tool.Another appeal of Harry Potter is that the seies has been so
widely translated. This is a comparative study. Therefore, the more
languages that a text is available in, the broader the potential scope of our
project may be. Moreover, Harry Potter translations are actually genuinely meaningful
to their readerships. Because Harry Potter is read for enjoyment, people
expect to and happily read these books in their native language, i.e., translated
copies themselves are celebrated and beloved. In contrast, literary or technical writing is of most pertinence to
academics. Amongst these circles, there is more merit in reading such
texts in their original languages, i.e., not translated copies, and therefore
translated copies, though they have a practical purpose, have less professional substance.
Errors? | grammaticality versus mistranslation
The
term "divergences" was coined to capture the nature of the elements we intended to
analyze in Google translated English texts. Although the term "error" can be used to
some extent in describing these elements, it does not capture the full scope of our
analysis. Consider the following sentences. (1) Tortu žena jedla.
cake-fem, sing -accu
woman-fem, sing - nom eat-fem, sing -past
(2) The woman ate the dessert.
Imagine Google translated the Slovak sentence 1 into the English
sentence 2. Note that sentence 2 does not contain any errors in the sense that
the sentence is ungrammatical in anyway. However, Slovak torta means cake not
dessert. Although the two words are semantically related, this is a mistranslation.
Ultimately, the term "errors" suggests that in this project we
simply proofread and corrected ungrammatical text. While accounting for ungrammatical structures
was a component of our analysis, the term "divergence" conveys how the elements which we
chose to analyze in this project may have lost either grammaticality or semantic
quality (or some combination of the two) post Google Translate processing. Furthermore,
it emphasizes that we are comparing how these elements manifest across different
languages when these languages are translated into English.
Mistranslations | what doesn't count
Translation is more an art than a science because one to one grammatical and
semantic correspondence does not exist between languages. In English, it is proper
to say, "I am on the wrong bus." In Slovak, the word "wrong" doesn't exist in
this sense, and it would be more common to use the word zlý meaning "bad."
therefore, if google translated wrong bus as zlý autobus, the semantic
quality of the original phrase is retained, and this would not be considered a
mistranslation, at least not within the scope of this project. This is a one word example, but
this aspect of feature can manifest further, such as on a sentence level. A concept easily conveyed
in one sentence in one language might require two in another. This is one reason why the number of sentences
between or German, Slovak, and French texts did not correspond, a point that we needed to be mindful
of in developing a schema which captured how machine translation choices diverged across the three languages. Schema & Markup |
<txt><p><s>
The three Google translated English Harry
Potter texts were put into individual XML files. Content was chunked into paragraphs and
then sentences. Note that the number of paragraphs corresponded between the Slovak,
German, and French versions, but again, the number of sentences did not. Each of the three
authors then examined their assigned version for divergences in relation to the
original, source texts at both the sentence and word level. The markup described
below was constrained by a schema developed in RELAXNG. Transpositions | <s tr="0"
The
tag "s" marks a setence, and "tr" stands for transposition. This term refers
to a change the relative position of an element. Note that in English, proper word
or phrase order is often paramount for grammaticality and meaning. He books read
good is ungrammatical, though perhaps it's meaning is still recoverable ("He
read good books"). On the other hand, "Sally tickled John" has a differently meaning
entirely if rendered as "John tickled Sally." Boolean values were used to mark whether or not a sentence was
transposed relative to what would have been both grammatical and semantically valid according
to the source language. A value of "0" or
"False" was assigned to properly ordered sentences. A value of "1" or "True" was assigned
to misordered sentences
Divergences | <difs id="34"
Divergences were marked on word level by the tag "dif." All difs were assigned ID
numbers which corresponded with the order in which they appeared in the document.
The ID number was marked with the intention that it be used in future projects, which
we anctipated involving corresponding
markup of the source document. See the
discussion for elaboration. Beyond this ID number, divergences were also tagged for part of
speech and type, as defined below.Part of Speech | pos
Our schema accounted for 8 parts
of speech. Every dif was required to be marked as only one of these 8 options. ="det"
"Determiners" are commonly referred to as articles in English.
Many nouns in English require a preceding article, depending on their position
in the sentence. There are definite and indefinite articles, and sometimes
articles have to accurately reflect plurality and/or special or mental distance.
e.g. "the cat," "a mouse," "this
tomato"
="n"
Nouns refer to peoples, places, things, and
ideas. There are common and proper nouns, for instance city versus King's
Cross. English nouns must be marked for singularity versus plurality. e.g. "He and Hermione walked," "They went to Diagon
Alley"
="v"
Verbs express action, e.g., running, or
state of being, e.g., exists. There are also helping verbs that encode
grammatical features, e.g., "He was walking. English verbs must reflect
proper person, plurality, tense, aspect, and mood. e.g. "Harry
Potter flew on a broomstick," "He will go to Hogwarts"
="adj"
Adjectives are words which describe nouns.
e.g. "Crookshanks was fluffy"
="adv"
Adverbs are words which describe verbs, i.e.,
action or state of being. e.g. "He was not there," "The
three headed dog growled vicisouly," "There was a cat"
="pp"
Prepositions describe a noun or pronoun's
relative mental or spatial position. e.g. "The cat sat in front
of Privet Drive," "He is the headmaster of Hogwarts"
="pro"
Pronouns take the place of nouns. In English,
they must express proper plurality, person, and gender. e.g.
"She summoned Dumbledore," "Who found Neville's toad?",
"Her," "He wondered to himself"
="conj"
Conjuctions are connector words, either
between words, phrases, or senses. e.g. "Ron laughed but
Harry didn't," "Severus Snape, who was the Potions master," "cats, toads,
and owls"
Type | type
Difs were marked for their error type. A
dif could be assigned only one error type. If a dif contained more than one error,
it was marked as a mistranslation. This schema accounted for 11 possible dif types.
Note that in the examples below, the italicized word is a hypothetical example of
the dif type discussed. Also, typing errors required that we make grammatical
judgements. These grammatical judgements, in the examples below and throughout our
marked up texts, reflect our own shared American English grammars and dialects. This
was simply a coding decion, and we acknowledge that different and equally valid grammars and dialects
exist throughout the English language.="tense"
Tense relays time information: past,
present, future.e.g. "He will go last night," "She was
dances at the ball"
="case"
Case relays a pronoun or noun's grammatical
function, for example, is the noun or pronoun the subject or direct object of the
sentence. e.g. "He talked to she," "Him cast a
spell"
="num"
Number relays the correct plurality or
singularity in a verb or noun or pronoun. e.g. "We goes to
Hogwarts," "The cat chase the mouse."
="gen"
Pronouns must reflect the proper gender. e.g. "Snape talked to Dumbledore because she was the smartest
professor at Hogwarts"
="mst"
A mistranslation occurs when two or more other
error types are present in a single dif or when semantic integrity is lost. e.g. "He ate a cactus," "He were going
now"
="moo"
Mood relays a speaker's attitude about what he
is saying, for example, if he thinks it is factual or not. The conditional mood is
common in English. e.g. "He will go to the ball if he had been
allowed"
="asp"
Aspect specifies when in time a verb takes
place. For example, within the English present tense, there is the present
progressive tense. Within the past tense, there is the past perfect. There are many
other examples. e.g. "He spoken to him last
week"
="per"
Verbs and pronouns should reflect the correct
person, i.e., first, second, versus third person. e.g. "I
rides the broomstick," "They goes to Hogwarts"
="del"
Deletion errors occur where elements required
to make a sentence grammatical were omitted. e.g. "Ron
talked to [the] professor"
="pos"
Part of speech errors occur where elements
present with the correct meaning but aren't morphologically marked as the correct
part of speech. e.g. "Malfoy laughed cruel"
="ins"
Insertion errors occur when translation
adds nonsequitor elements not in the source text.e.g. "They were proud of himself that
their daughter won"
Two or More Errors | in defense of <mst>
marking
Marking elements containing more than two "grammatical errors,"
for example, tense and person, as a "mistranslation" was a coding decision made for
two reasons. First of all, XML potentially impose hierarchy on difs where there
shouldn't be. Consider a scenario where sentence 2 below is supposed to be a rendition of sentence 1.(1) He talked to John.
(2) He talked to she.
The pronoun "she" in sentence 2 is supposed to stand for "John." The correct rendering would have been "him."
Thus, this "dif" is
incorrect in terms of gender and case. In one
approach, "she" could either be coded in a structure such as
<case><gender>she</gender></case> or
<gender><case>she</case></gender>. Here, XML is ranking the
two error types in either a way that is linguistically unnecessary and/or meaingless or in
a way that what we would have to justify ("'Case' errors should be nested within 'gender' errors because...").
We might have assigned multiple type attributes within one element to instances like
this. However, that begs the question, is a simultaneous "case" and "gender" error
diferent from a simultaneous "case" and "number" error? This is something that would
have required justification. Ultimately, the decision to mark difs presenting with multiple errors as "mistranslations"
was made on semantic/syntactic grounds. Consider the following two sentences.
(1) Professor Dumbledore talked to the boy because he skipped
class.
(2) Professor Dumbledore talked to the boy because I skipped
class.
(3) Professor Dumbledore talked to the boy because she skipped
class.
The three sentences differ only by one word: "he," versus "I," versus "she." Imagine that
sentence 2 was the output produced by Google Translate, but according to the
source document, the sentence output should have been rendered as sentence 1. "We" is the
sole dif in the sentence, and it contains two types of errors: person and number. Within the
framework of our schema implementation, this dif would be considered a "mistranslation."
Recall that "mistranslation" involves a loss of semantic integrity.
Consider sentence 3, where "he" is rendered as "she," a rendering which involves just one
error type: gender. In this instance, it is
plausible that readers would understand
that "she" should have been "he" and is actually referring back to "the boy." Importantly, it also
plausible that readers would be able to glean this information from the context of the sentence alone. That is to say,
with just one error within the dif, overall sentence meaning is more salvageable solely within the scope of the sentence than
it would have been had the dif contained more errors. Certainly, in going between sentences 1 and 2, the overall meaning
is altered more drastically. Readers
would need context beyond the confines of the sentence in order to recover the
fact that "I" is not in fact referring to the speaker but to "the boy."
Ultimately, the more errors
within a "dif," the worse off its semantic integrity, thus providing grounds to deem it a mistranslation.
There is further justification for considering the extent of a sentence to be a boundary for scope in terms of delineating
the parameters of the context in which difs may be assessed for semantic salvageability. First of all, in our coding scheme,
difs are being marked within sentences, not within groups of sentences or paragraphs. Secondly,
this design choice is true to how Google Translate operates. It processes text on a sentence by sentence basis.
Marking difs with two plus errors as a mistranslation is strong decision
both linguistically and technically. Regardless, it should be noted that we found very few instances of multiple errors as we were coding.