Purpose | research questions
Originally, this project was born out of our experiences as foreign language
students. Nowadays, anyone who has studied or worked with a foreign language has
probably tried their hand at Google Translate... with various successs. Between the
three of us, we had each studied German, Slovak, and French, and we could say that
some of us had better experiences with Google Translate than others based on which
language we studied. Fundamentally, we wanted to develop a project which examined
how machine translation works on the basis of its output quality. More than that, we
also wanted to characterize how quality diverged amongst outputs when translating
from multiple language types into a single target language. To do this, we took the
first chapter from the German, Slovak, and French copies of Harry Potter and the
Sorcerer's Stone and processed them into English through Google Translate.
As per our analysis, we came up with the following research questions.1. In Google Translate, does the source language
have an impact on output quality? More specifically, between German, Slovak, and
French, which language would result in the poorest quality English translation
in terms of syntax, grammaticality, and semantic integrity?
2. How do syntactic, grammatical, and semantic errors manifest
between the different translations? To describe this, we specified two further
questions. How many sentences were grammatically misordered in each translation?
Which parts of speech were most ungrammatical and/or most often mistranslated in
each translation? Of these parts of speech, which grammatical features were most
affected, i.e., tense, case, gender, person, etc.?
To answer these questions, we had to define all of our terminology in terms of our
project scope. Even the simplest of terms -- "error," "grammatical," "pronoun,"
"conjunction" -- needed to be precisely delineated. On the basis of these
definitions, we then implemented a coding scheme that would help us capture the
differences amongst the three translations and eventually perform statistical
analyses. A description of this design process may be found on our methodology page,
on our menu bar, under "Research" > "Method."
Hypothesis | original predictions
Note
that English, German, Slovak, and French are in fact related languages. They are all
members of the Indo-European language family. English is most closely related to
German then to French then Slovak. On this basis, we predicted that Slovak would
produce the most errors in terms of syntax and also morphology, lexicality, and
semantic integrity across all parts of speech. Note that Slovak differs drastically
from English in terms of etymology, syntax (word order is free in Slovak), lack of
determiners (i.e., "the," "an"), an extensive case marking system, and finally,
Slovak has only three tenses. We imagined that we would see a huge amount of deleted
words and tensing errors in the Slovak to English translation. On the same grounds,
we predicted that German, being in the same language subfamily as English, would
result in the highest quality translation. At the same time, we were also aware to
an extent that Google Translate processes languages with a statistical approach and
does not utilize grammar rules or semantic processing. This left room for our
predictions to surprise us.Further Motivations | a broader scope
Primarily, this project is intended to be a formal analysis of machine translation.
Our methodology was designed based on linguistic theory, and the content of the site
itself is loosely based on the formatting of an academic paper, particularly the
"Research" section. At the same time, we did not want to assume that visitors to our
page had a background in computational linguistics or foreign language study at all!
We were aware that our choice of text, Harry Potter, and also Google
Translate itself, a hugely popular tool, was pretty much an open invitation for
non-linguists and even perhaps an audience younger than college-aged. This presented
us with a very exciting opportunity! To take advantage of it, this website is
intended to provide an informal introduction to not only linguistics and its
technological applications, but also to the world langugages and geography. As you navigate the website, keep an eye out for easter eggs! Our
"Background" section exists as a crash course in major concepts used throughout our
project. "Lingusitics" presents a summary of each of the core aspects of the field.
"Google" gives an overview on how Google Translate and machine translation works in
general. "Languages" is a brief, Harry Potter-themed introduction to language
etymology. To newcomers who weren't aware, linguistics as an academic discipline
touches upon a range of issues involving language and congition. This project, and
computational linguistics in general, is just one aspect of the field.