Just like his millions of friends on Facebook, Meta founder and CEO Mark Zuckerberg takes to the social network to announce important news. Zuckerberg blasts off with Meta’s “No Language Left Behind” project.
BY:
Eric De Grasse Chief Technology Officer PROJECT COUNSEL MEDIA
Specifically, Meta AI tweeted, the company built an AI model capable of translating between 200 languages — for a total of 40,000 different translation directions. Zuckerberg wrote:
“To give a sense of the scale, the 200-language model has over 50 billion parameters. The advances here will enable more than 25 billion translations every day across our apps.”
According to a July 6, 2022 LinkedIn post by Meta AI, the modeling techniques from this work have already been applied to improve translations on Facebook, Instagram, and Wikipedia, and several large law firms and accounting firms will be members of a beta team to test the technology on document data silos.
A Meta AI blog post implies that the company aims to integrate translation tools developed as part of NLLB into the metaverse, noting that “the ability to build technologies that work well in hundreds or even thousands of languages will truly help to democratize access to new, immersive experiences in virtual worlds.”
While the paper does not include a list of languages addressed in the project, we did a quick scan and found the NLLB page on GitHub mentions Asturian, Luganda, and Urdu as examples of low-resource languages. The authors — some of whom are associated with UC Berkeley and Johns Hopkins University, in addition to Meta AI — noted that the degree of standardization varied across the languages studied, with an apparently “single” language potentially contending with competing standards for script, spelling, and other guidelines.
As we noted last year, researchers have weighed the potential risks and benefits of the new tools from NLLB for low-resource language communities. They have considered the impact on education especially promising, but wondered whether increasing the visibility of certain groups online might make them more vulnerable to increased censorship and surveillance, or exacerbate digital iniquities within the groups.
In preparation for the Meta project, researchers interviewed native speakers to better understand the need for low-resource language translation support. They then created a new dataset to level the playing field for low-resource languages: NLLB-Seed, a dataset composed of human-translated bitext for 43 languages.
The team used a novel bitext mining method to create hundreds of millions of aligned training sentences for low-resource languages. This process entailed lifting monolingual data from the Web and determining whether any two given sentences could be a translation.
Researchers then calculated the “distance” between the sentences in a multilingual representation space using LASER 3, which researcher Angela Fan singled out as a major contribution to improved translation of low-resource languages. Starting with a more general model, LASER, researchers can specialize the representation space to extend to a new language with very little data.
They also employed modeling techniques designed to significantly improve low-resource multilingual translation by reducing over-fitting.
NLLB introduced another innovation: FLORES 200, a high-quality human-translated evaluation dataset. Fan explained that previous SOTA had only evaluated performance on 101 languages using FLORES-101, a many-to-many evaluation dataset from 2021.
The authors reported that their model achieved a 44% improvement in BLEU, thus “laying important groundwork towards realizing a universal translation system”. One one author noted there will be an expansion into the FIGS – French, Italian, German and Spanish. But this had been underway already.