Characterizing Linguistic Divergences in Machine Translation

Future research we are interested in would be exploring further languages in the same analysis. Additionally, we are interested in a longitudinal analysis. For instance, Marathi was very recently added to Google Translate's inventory of languages. Knowing that Google Translate is supposed to improve over time based on user feedback, perhaps, we could analyze divergences in Marathi to English outputs at intervals over time and check for degree of improvement.

As far as improving our analysis, perhaps we could develop a schema that considers finer details. For instance, we could identify clauses and identify in which they are transposed (are predicates appearing before subjects? Are relative clauses not properly conjoined with the noun they are describing?) Another possibility could be developing a more coherent strategy for dealing with verb phrases... should they all be identified as a phrases or would it be better to identify them into phrases and then chunk them into individual words, and identify errors on a word level?

Finally, while this aspect was set aside for this particular project for reasons specified in the methodology section, perhaps we could develop a method where we would attempt to identify multiple errors in one word, and see how this different approach would affect our results.

It would also be interesting to characterize the types of grammatical environments specific errors are occurring. This might shed light on what is triggering specific errors.

Overall, we believe our analysis as is did a thorough job of answering our initial research questions and could serve as a sound jumping off point into asking more exploratory questions regarding machine translation and machine learning.

Discussion

Implications of findings