Research Suggests A Large Proportion Of Web Material In Languages Other Than English Is Machine Translations Of Poor Quality Texts

from the curse-of-recursion dept

The latest generative AI tools are certainly impressive, but they bring with them a wide range of complex problems, as numerous posts on Techdirt attest. A new academic paper, published on arXiv, raises more of them, but from a new angle. Entitled “A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism”, it studies the impact of today’s low-cost AI translation tools on the online world:

We explore the effects that the long-term availability of low cost Machine Translation (MT) has had on the web. We show that content on the web is often translated into many languages, and the quality of these multi-way translations indicates they were primarily created using MT.

“Multi-way” in this context means that two or more sentences can be found translated in several different languages. According to the researchers, of the 6.38 billion sentences studied, 2.19 billion are found in multi-way translations. In particular, languages that appear less frequently online had more multi-way sentences, with disproportionately more found among the rarest languages. Another key feature observed is that highly multi-way parallel translations are “significantly worse” than two-way translations. Moreover, the multi-way data consisted of shorter, more predictable sentences compared to two-way translations. Inspecting a random sample of 100 highly multi-way parallel sentences, the researchers found:

the vast majority came from articles that we characterized as low quality, requiring little or no expertise or advance effort to create, on topics like being taken more seriously at work, being careful about your choices, six tips for new boat owners, deciding to be happy, etc. Furthermore, we were unable to find any translationese or other errors that would suggest the articles were being translated into English (either by human translators or MT), suggesting it is instead being generated in English and translated to other languages.

Taking these observations together, the paper suggests that highly multi-way sentences are generated using AI, specifically machine translations of low-quality English-language originals. Further analysis showed that in the languages found less commonly online, most translations are multi-way parallel, which means that AI content dominates translated material in those languages. In addition:

a large fraction of the total sentences in lower resource languages have at least one translation implying that a large fraction of the total web in those languages is MT generated

In other words, however bad the problems are that AI is creating for English-language material, they are probably worse in languages found less commonly online, since a major proportion of the Web in those languages is generated by machines, not humans.

If this conclusion holds true beyond the dataset studied by the researchers, there is another interesting issue. Generative AI depends on large training sets, which often come from the Web. For languages other than English, the new paper suggests that much of the training material will be translations by AI of low-quality, possibly AI-generated texts. This issue of generative AI feeding on itself has been studied in earlier research. One group summarized their results on “The Curse of Recursion” as follows:

We find that use of model-generated content in training causes irreversible defects in the resulting models, where tails of the original content distribution disappear. We refer to this effect as Model Collapse and show that it can occur in Variational Autoencoders, Gaussian Mixture Models and LLMs [Large Language Models]. We build theoretical intuition behind the phenomenon and portray its ubiquity amongst all learned generative models. We demonstrate that it has to be taken seriously if we are to sustain the benefits of training from large-scale data scraped from the web.

The new research suggests this is likely to be a more serious problem when building generative AI systems in languages for which there is less material online that can be used for training. The good news is the fact that the presence of multi-way sentences in languages other than English is a strong indication that they have been produced by AI, which offers a means to spot them and filter them out. The bad news is that if this technique is applied to improve the quality of training materials and avoid “model collapse”, the already energy-hungry process of training generative AI systems will be even more damaging for the planet.

Follow me @glynmoody on Mastodon and on Bluesky.

Filed Under: , , , , , , , ,

Rate this comment as insightful
Rate this comment as funny
You have rated this comment as insightful
You have rated this comment as funny
Flag this comment as abusive/trolling/spam
You have flagged this comment
The first word has already been claimed
The last word has already been claimed
Insightful Lightbulb icon Funny Laughing icon Abusive/trolling/spam Flag icon Insightful badge Lightbulb icon Funny badge Laughing icon Comments icon

Comments on “Research Suggests A Large Proportion Of Web Material In Languages Other Than English Is Machine Translations Of Poor Quality Texts”

Subscribe: RSS Leave a comment
13 Comments
bhull242 (profile) says:

Re: Re:

I know this is almost certainly a joke, but I know someone who said this unironically. How someone could think that someone from early-1st-century Judea would be speaking in any form of English (when even Old English didn’t exist until at least the mid-5th century) while he was on Earth—let alone Modern English, which didn’t exist until the late 17th century and which is almost nothing like Old English to the point that they are mutually unintelligible and use very different writing systems. (Old English used to be runic (using the futhorc—a rune set derived from the Germanic 24-character elder futhark with 5+ additional runes) before later using a version of the Latin alphabet with some additions like æ, œ, ð, þ, Ƿ, Ȝ and ⁊ while missing “u” (“v” served that purpose), “w” (both “v” and “Ƿ” did that), and “j” (“i” was used instead) while only using “k”, “q”, or “z” for loanwords. They also used ᵹ (insular G) instead of “g”, and they also had a long s (ſ) and an insular s (which looks like ɼ) instead of just “s”. Additionally, their e’s, f’s, and r’s also looked very different. Diacritics were also used a lot more often.)

bhull242 (profile) says:

Re:

You make it sound so easy, but it really, really isn’t. The majority of the Earth’s population is monolingual because learning more than one language is hard. It’s a lot easier when you’re a toddler, to the point they could legitimately have two native languages, but people rarely learn new languages at that age because few families are bilingual.

And that’s just learning any language. English in particular can be unusually difficult to learn (though it’s not as hard as Icelandic or ideographic languages). Both our grammar and our vocabulary are a weird mishmash of Germanic and Romantic languages, and they also have some weird quirks essentially unique to English. Our pronunciation and spelling is even weirder, especially since the Great Vowel Shift made the way we pronounce vowels drastically different from pretty much any other language that primarily uses the Latin alphabet and since we have so many loanwords which have spellings almost unchanged from their original languages but drastically different pronunciations or vice versa. (Side note: how did we get our current pronunciation of “karaoke”? The original Japanese pronounce it basically how it has been romanized (“kah•rah•oh•kay”), and no other English word pronounces an “a” with an “ee” sound, so what gives? I get why we changed the first “a” to make an “ay” or “air” sound and the “e” to make an “ee” sound, but the second “a”’s sound makes no sense at all.) There’s actually a joke that “fish” can be spelled “ghoti” without changing the pronunciation (“gh” from words like “enough”, “o” from “women”, and “ti” from the suffix “–⁠tion”), and “ough” is pronounced incredibly inconsistently («ou» in “drought”, «ə» in “thoroughly”, «o͞o» in “through”, «ō» in “though”, «ŏ» in “thought”, «ŏf» in “cough”, «əf» in “Greenough”, «ŭf» in “enough”, «ŭp» in “hiccough” (now more commonly spelled “hiccup”), and «ŏk» in “hough” (now more commonly spelled “hock”), with “slough” being pronounced «slŭf», «slou», or «slo͞o» depending on the meaning.

This (mostly) ultimately comes from Britain (where English originates) having been conquered by the Norse and the French early on, which greatly influenced our language to mix the original version we had with numerous Old Norse, French, and Latin words and rules pretty early in the history of Old and Middle English. There was also the fact that, unlike many other languages, there is no “English Authority” or whatever that has official say on what is or isn’t “correct” English. Germany, France, Spain, Japan, China, and several other countries do have regulations or regulatory bodies that dictate one or all of their respective languages’ official vocabulary, orthography (alphabet, abjads (think Arabic), logograms (think Ancient Egyptian hieroglyphs, Chinese characters, or Kanji), syllabary (think the Japanese katakana and hiragana), pictograms, numerals, ligatures, etc.), spelling(s)/writing, punctuation, reading/writing order, capitalization (where applicable), pronunciation(s), grammar, etc. In fact, until Shakespeare’s day, there were essentially no standardized rules for English, and Webster was also one of the first to see any success in standardizing English spellings. This meant a lot of changes to the language that made it more needlessly complicated.

Honestly, the one thing that makes English less complicated than most languages is the lack of grammatical gender in the vast majority of words, and those that have grammatical gender are almost entirely nouns, primarily third-person pronouns, nouns referring to humans as such (“man”/“men”, “woman”/“women”, “boy(s)”, “girl(s)”, “gentleman”/“gentlemen”, “lady”/“ladies”, “guy(s)” (though this is also often used as if gender-neutral), “gal(s)”, “dude(s)” (also sometimes gender-neutral), “dudette(s)”, etc.), words for family members/significant others (“father”, “mother”, “dad”, “mom”, “daddy”, “mommy”, “pa”, “ma”, “papa”, “mama”, “pappy”, “mammy”, “son”, “daughter”, “uncle”, “aunt”, “nephew”, “niece”, “grandfather”, “grandmother”, “grandpa”, “grandma”, “grandson”, “granddaughter”, “boyfriend”, “girlfriend”, “fiancé”, “fiancée”, “groom”, “bride”, “husband”, “wife”, etc.), several occupations (mostly ending in “–man”, “–men”, “–woman”, or “–women”, though there are some others, like “waitress”), titles of nobility/royalty, some prefixed titles (“Mr.”, “Mrs.”, “Ms.”, “Miss”, “Sir”, or “Lady”, but not “Dr.” or “Professor”), and some animals (mostly livestock or hunting targets, like “ram” vs “ewe”, “bull” vs “cow”, “rooster” vs “hen”, “buck” vs “doe”, etc.); in all of those cases, the grammatical gender virtually always refers to actual organisms with actual genders/sexes (use of “she” to refer to vessels and such notwithstanding). German has three grammatical genders (masculine, feminine, and neuter), and France and Latin (along with most Romantic languages) both have two grammatical genders (masculine and feminine) present in every single noun and most pronouns, which is reflected in other words (mostly articles and adjectives, but sometimes verbs are also affected), so it’s kinda surprising how little grammatical gender there is in English.

ECA (profile) says:

translating to english

Is Horrible.
Our language is so convoluted, we have words Spelled the same, Sound different, and have 200+ meanings.
then we have words that are spelled different but sound the same.

Now TRY to be a human and translate. Our teachers told us to SPELL IT AS IT SOUNDS. I laugh.
NOW have a computer, READ TEXT, Translate it to another language, and have the SAME meaning, when They have the same Structure of language, and words with many meanings and many pronunciations.

ITS just cruel.

LostInLoDOS (profile) says:

Oops 🙊

The reality is outside of basic language pairs everyone in international exchange learns, most languages only have one or two common pairings.
Soy you wind up with, for example, English->german->russian->Mongoloan->Mao.
The computer knows the most common pairings for each language, but doesn’t jump the fence to get there.

There’s hundreds of variations of Chinese within the country. Each with slight variations in script.
There’s 7 Japanese forms and dozens of local dialects with their own modifications.

Even in the US we have regional words not used elsewhere. Take “youz” as an example.

How many are shocked by the word fanny? Bum in the US and India, means something quite different in other countries.

Machine translation has come a very long way. But has much further to go.

Leave a Reply to LostInLoDOS Cancel reply

Your email address will not be published. Required fields are marked *

Have a Techdirt Account? Sign in now. Want one? Register here

Comment Options:

Make this the or (get credits or sign in to see balance) what's this?

What's this?

Techdirt community members with Techdirt Credits can spotlight a comment as either the "First Word" or "Last Word" on a particular comment thread. Credits can be purchased at the Techdirt Insider Shop »

Follow Techdirt

Techdirt Daily Newsletter

Subscribe to Our Newsletter

Get all our posts in your inbox with the Techdirt Daily Newsletter!

We don’t spam. Read our privacy policy for more info.

Ctrl-Alt-Speech

A weekly news podcast from
Mike Masnick & Ben Whitelaw

Subscribe now to Ctrl-Alt-Speech »
Techdirt needs your support! Get the first Techdirt Commemorative Coin with donations of $100
Techdirt Deals
Techdirt Insider Discord
The latest chatter on the Techdirt Insider Discord channel...
Loading...