Characterizing Linguistic Divergences in Machine Translation

Purpose | research questions

Originally, this project was born out of our experiences as foreign language students. Nowadays, anyone who has studied or worked with a foreign language has probably tried their hand at Google Translate... with various successs. Between the three of us, we had each studied German, Slovak, and French, and we could say that some of us had better experiences with Google Translate than others based on which language we studied. Fundamentally, we wanted to develop a project which examined how machine translation works on the basis of its output quality. More than that, we also wanted to characterize how quality diverged amongst outputs when translating from multiple language types into a single target language. To do this, we took the first chapter from the German, Slovak, and French copies of Harry Potter and the Sorcerer's Stone and processed them into English through Google Translate. As per our analysis, we came up with the following research questions.

1. In Google Translate, does the source language have an impact on output quality? More specifically, between German, Slovak, and French, which language would result in the poorest quality English translation in terms of syntax, grammaticality, and semantic integrity?

2. How do syntactic, grammatical, and semantic errors manifest between the different translations? To describe this, we specified two further questions. How many sentences were grammatically misordered in each translation? Which parts of speech were most ungrammatical and/or most often mistranslated in each translation? Of these parts of speech, which grammatical features were most affected, i.e., tense, case, gender, person, etc.?

To answer these questions, we had to define all of our terminology in terms of our project scope. Even the simplest of terms -- "error," "grammatical," "pronoun," "conjunction" -- needed to be precisely delineated. On the basis of these definitions, we then implemented a coding scheme that would help us capture the differences amongst the three translations and eventually perform statistical analyses. A description of this design process may be found on our methodology page, on our menu bar, under "Research" > "Method."

Hypothesis | original predictions

Note that English, German, Slovak, and French are in fact related languages. They are all members of the Indo-European language family. English is most closely related to German then to French then Slovak. On this basis, we predicted that Slovak would produce the most errors in terms of syntax and also morphology, lexicality, and semantic integrity across all parts of speech. Note that Slovak differs drastically from English in terms of etymology, syntax (word order is free in Slovak), lack of determiners (i.e., "the," "an"), an extensive case marking system, and finally, Slovak has only three tenses. We imagined that we would see a huge amount of deleted words and tensing errors in the Slovak to English translation. On the same grounds, we predicted that German, being in the same language subfamily as English, would result in the highest quality translation. At the same time, we were also aware to an extent that Google Translate processes languages with a statistical approach and does not utilize grammar rules or semantic processing. This left room for our predictions to surprise us.

Further Motivations | a broader scope

Primarily, this project is intended to be a formal analysis of machine translation. Our methodology was designed based on linguistic theory, and the content of the site itself is loosely based on the formatting of an academic paper, particularly the "Research" section. At the same time, we did not want to assume that visitors to our page had a background in computational linguistics or foreign language study at all! We were aware that our choice of text, Harry Potter, and also Google Translate itself, a hugely popular tool, was pretty much an open invitation for non-linguists and even perhaps an audience younger than college-aged. This presented us with a very exciting opportunity! To take advantage of it, this website is intended to provide an informal introduction to not only linguistics and its technological applications, but also to the world langugages and geography.

As you navigate the website, keep an eye out for easter eggs! Our "Background" section exists as a crash course in major concepts used throughout our project. "Lingusitics" presents a summary of each of the core aspects of the field. "Google" gives an overview on how Google Translate and machine translation works in general. "Languages" is a brief, Harry Potter-themed introduction to language etymology. To newcomers who weren't aware, linguistics as an academic discipline touches upon a range of issues involving language and congition. This project, and computational linguistics in general, is just one aspect of the field.

Introduction

Purpose, predictions, motivation