Researchers explore surprising behavior of machine translation

Home / Uncategorized / Researchers explore surprising behavior of machine translation

By:

Eric De Grasse
Chief Technology Officer

 

5 June 2020 (Paris, France) – Training software to accurately sum up information in documents will have a great impact in many fields, such as medicine, law, and scientific research. And if you are a member of the e-discovery ecosystem, no, you will not see this technology at LegalTech or ILTA.  But law firm lawyers and in-house lawyers who have been attending computational linguistic events have seen this technology (“Metamind”, recently acquired by Salesforce, springs to mind) are starting to make it routine and rely on a “machine” to analyze and paraphrase articles, research papers, and other text for you. prominent name in machine learning and natural-language processing. There are others out there, like “Narrative Science” and “Maluuba” (acquired last year by Microsoft).

All of these use several machine-learning tricks to produce surprisingly coherent and accurate snippets of text from longer pieces. And while it isn’t yet as good as a person, it hints at how condensing text could eventually become automated. Granted, the software is still a long way from matching a human’s ability to capture the essence of document text, and other summaries it produces are sloppier and less coherent. Indeed, summarizing text perfectly would require genuine intelligence, including common sense knowledge and a mastery of language.  But lawyers are realising it makes no sense to employ teams of contract translation teams for much of their work, or standard e-discovery translation software.

But parsing language remains one of the grand challenges of artificial intelligence and it’s a challenge with enormous commercial potential. Even limited linguistic intelligence – the ability to parse spoken or written queries, and to respond in more sophisticated and coherent ways – could transform personal computing.

Last year our EMEA e-discovery review unit participated in a beta test of such a system that learns from examples of good summaries, an approach called supervised learning, but also employs a kind of artificial attention to the text it is ingesting and outputting. This helps ensure that it doesn’t produce too many repetitive strands of text, a common problem with summarization algorithms.

The system experiments in order to generate summaries of its own using a process called “reinforcement learning”. Inspired by the way animals seem to learn, this involves providing positive feedback for actions that lead toward a particular objective. Reinforcement learning has been used to train computers to do impressive new things, like playing complex games or controlling robots and at the end of 2016 it seemed to be on everybody’s “breakthrough technologies in 2017” list. Those working on conversational interfaces are increasingly now looking at reinforcement learning as a way to improve their systems.

If you are a lawyer, would you trust a machine to summarize important documents for you? Well, did you trust predictive coding when it first hit the market? No. Despite all the hype about predictive coding, it is still a work-in-progess as is this translation software.

In modern translation software, a computer scans many millions of translated texts to learn associations between phrases in different languages. Using these correspondences, it can then piece together translations of new strings of text. The computer doesn’t require any understanding of grammar or meaning; it just regurgitates words in whatever combination it calculates has the highest odds of being accurate. The result lacks the style and nuance of a skilled translator’s work but has considerable utility nonetheless. Although machine-learning algorithms have been around a long time, they require a vast number of examples to work reliably, which only became possible with the explosion of online data. As a Google engineer (he works in Google’s Speech Division) told me last year at DLD Tel Aviv:

When you go from 10,000 training examples to 10 billion training examples, it all starts to work. In machine learning, data trumps everything. And this is especially the case with language translation.

Here in Europe our EMEA e-discovery review team and our language translation service have been using Bitext (they are based in Madrid), which is a deep linguistic analysis platform, or DLAP. Unlike most next-generation multi-language text processing methods, Bitext has crafted a platform. Based on our analysis plus an analysis done by ASHA, the Bitext system has accuracy in the 90 percent to 95 percent range. Most content processing systems today typically deliver metadata and rich indexing with accuracy in the 70 to 85 percent range. The company’s platform supports more than 50 languages at a lexical level and +20 at a syntactic level and makes the company’s technology available for a wide range of applications including Big Data, Artificial Intelligence, social media analysis, text analytics, etc. It solves many complex language problems and integrates machine learning engines with linguistic features. These include segmentation, tokenization (word segmentation, frequency, and disambiguation, among others.

Claims that machine translation had achieved near-human parity (however it is defined) back in 2016 were met with disbelief. The technology was still far from being able to produce quality equivalent to that of human translators and it was the metrics that were flawed, people were quick to point out.

Love it or hate it, neural machine translation (NMT) became widely adopted across the language industry in the years that followed. It has fundamentally changed the supply chain and disrupted the way humans interact with translation technology, generating significant productivity gains for users.

NMT now underpins parts of the translation workflow, but relatively little is known about how the machine actually understands content or generates output, and why some of the residing quality issues persist.

Two researchers have now shone a light on some of the oddities found in NMT output, exploring unexpected behavior in RNN and Transformer NMT models. In a paper published on pre-print platform arXiv on May 25, 2020, Marzieh Fadaee and Christof Monz from the University of Amsterdam looked into “The Unreasonable Volatility of Neural Machine Translation Models.”

RNNs (Recurrent Neural Networks) are a type of artificial neural network, while Transformer is a deep machine learning model that was introduced by Google researchers in 2017. The latter is the newer and now more prevalent architecture used in machine translation and speech processing.

Fadaee was a PhD candidate at the university and has since become an NLP / ML Research Engineer at deep learning R&D lab Zeta Alpha Vector. Monz, who remains Associate Professor, describes his research interests as covering “information retrieval, document summarization and machine translation” on his LinkedIn page.

The basis for their research is that, although NMT performs well, it is not generally understood how the models behave. Examining the unexpected behavior of NMT could reveal more about its capabilities as well as shortcomings.

During their research, Fadaee and Monz observed that minor changes to the source sentences sometimes resulted in an “unexpected change in the translation,” which in some cases constituted a translation error. Since the models behaved inconsistently when confronted with similar source sentences, they are considered “volatile,” the two explained:

Important to note is that all source sentences, including modified ones, were  semantically correct and plausible for the purposes of their experiments.

The researchers performed a series of tests to analyze the translations of the modified source sentences and the types of changes that occurred.

The changes the researchers made to source sentences were minor and limited to the following: removing adverbs, changing numbers (by a maximum of plus five), and inserting common words. They also changed gender pronouns, having been inspired by prior work on gender bias.

Modification Sentence Variations
Deletion Some 500 years after the Reformation, Rome [now\φ] has a Martin Luther Square.
Substituting a number I’m very pleased for it to have happened at Newmarket because this is where I landed [30\31] years ago.
Insertion I loved Amy and she is [φ\also] the only person who ever loved me.
Substituting gender [He\She] received considerable appreciation and praise for this.

One test applied only to changes to numbers in source sentences. For this category of change, it was possible to have multiple variations of the original source sentence (e.g., +1, +2, +3, +4 and +5). Logically, the translations of the changed sentences should only differ to account for the change in number, but researchers found examples of “unexpectedly large oscillations” for both models.

They also looked at deviations from the original translation and classified them as major or minor deviations. The results showed major differences in 18% of RNN translations and 13% of Transformer translations.

Most of the deviations (ca. 70%) were “as expected,” meaning that they were justified by the change to the original source sentence, while unexpected changes included different verb tenses, reordered phrases, paraphrasing, preposition changes, and more. “The vast majority of changes are due to paraphrasing and dropping of words,” the researchers found. Unexpected changes did not necessarily impact translation quality.

Translation quality was tested separately through a manual evaluation by human annotators. Overall, 26% of changes observed for the RNN model impacted translation quality, compared to 19% of those observed for the Transformer model.

In conclusion, the researchers said, “even with trivial linguistic modifications of source sentences, we can effectively identify a surprising number of cases where the translations of extremely similar sentences are surprisingly different.” This means that NMT models are vulnerable to the slightest change in the source sentence, which points to two other potential shortcomings: generalization and compositionality.

Generalization refers to an MT system being able to translate long source sentences that it has not previously encountered. Compositionality is where an MT system combines multiple, simple sentence parts to build a longer, more complex string.

In their view, “the volatile behavior of the MT systems in this paper is a side effect of the current models not being compositional” because the systems clearly do not demonstrate a good understanding of the underlying sentence parts — if they did, they would not generate the inconsistencies observed.

Moreover, Fadaee and Monz said, while NMT models are capable of generalization, they do so without compositionality. As such, the researchers argued that NMT models “lack robustness” and hoped that their “insights will be useful for developing more robust NMT models.”

Related Posts