Today: December 14, 2025
December 14, 2025
3 mins read

Why doesn’t AI speak all languages ​​the same? The linguistic gap hidden by algorithms

Why doesn't AI speak all languages ​​the same? The linguistic gap hidden by algorithms

When we use artificial intelligence to translate a text, answer a question or write an email, we tend to imagine that it works the same in any language. The idea is logical: if he is “intelligent”, he should handle all languages ​​with the same ease. However, the reality is very different. The models do not perform the same in English as in Spanish, nor in Spanish as in Basque. Because? Is it an inevitable technological limitation or a reflection of deeper inequalities in the digital world?

To understand it, you have to look at the basis of these technologies: data. Language models, such as ChatGPT, they train with immense amounts of textboth original and created by people who have trained them. But here the first great asymmetry appears: most of the written content on the network is in English. It’s not a model preference, just it is what there is.

Training languages

OpenAI, the company behind ChatGPT, and other companies do not publish exact percentages of the weight of each language in training, nor can the models calculate them with the data they handle. Even so, the trend is evident: English dominates this context by far, followed by large global languages ​​such as Spanish, French or German. With quite a distance, we find languages ​​with limited digital presence such as Catalan or Welsh. And at an even greater distance, minority languages ​​whose textual trace on the Internet is scarce or almost non-existent.

With this distribution, the result is predictable: the models work better in languages ​​with more data. It is not about affinity, but about learning opportunity. When a model sees millions of examples in English, they better learn its grammar, vocabulary, different registers, and cultural background. On the other hand, when you receive few examples in a language you have less material to deduce reliable patterns.

Read more: Why machines don’t speak Spanish well and why they should

This explains why, in some languages, especially English, artificial intelligence seems more precise and natural, while in others it makes mistakes: agreement errors, expressions that sound “translated”, rigid constructions or a style that is too neutral or unfamiliar. The lack of data also affects the type of writing: languages ​​that use the Latin alphabet tend to be better covered than those with less digitally widespread systems, such as Arabic writing or indigenous alphabets, where the scarcity of examples generates more errors.

Can this gap be reduced?

Fortunately, modern AI does not simply passively reproduce this inequality. There are numerous strategies designed to somewhat mitigate the lack of data in scarce languages. One of the most important is the balancing of the corpus, that is, the number of texts you use to respond. So, even if English is thousands of times more abundant, during training you can increase the frequency with which the model consults minority languages ​​and reduce exposure to English. It is a way to prevent minority languages ​​from being buried.

Another key technique is multilingual transfer. Models do not learn each language separately: share internal representations. If the model learns Spanish, part of that knowledge is used for Portuguese or Italian. In the same way, German reinforces Dutch. This transfer helps languages ​​with few data as long as they belong to a language family with more abundant relatives. On the other hand, more isolated languages ​​– such as Japanese or Korean – benefit less from this process.

Teaching languages ​​to AI

Synthetic data are also generated through machine translation or multilingual parallel corpora, such as documents from international organizations or versions of Wikipedia, are used to learn equivalences between languages. In later stages, native human instructors intervene, correcting inappropriate expressions, reinforcing the appropriate tone, and honing in on cultural details that big data does not capture.

Finally, there are specific techniques to avoid what is called “catastrophic forgetfulness”: When the model continues to train with data in a dominant language and inadvertently starts to degrade what it knew in minority languages. In this way, regularization and continuous learning methods help maintain a certain balance.

Read more: What does the experiment mean with Sam, the boy who teaches machines to speak?

What happens with linguistic diversity?

Even so, no technical resource can fully compensate for the lack of data in a language and with little renewal of its content, so English remains the predominant language and, therefore, the gap persists.

This opens an important question: can artificial intelligence contribute to the loss of linguistic diversity? It’s a real risk. If it works better in English, some people may prefer to use it in that language. If the texts generated tend to have a homogeneous style, they can influence institutional, academic or media writing and thus displace local records. And if a language barely appears on the Internet, it may be left out of the technological tools that increasingly shape our communication.

Revitalize minority languages

There is also the opposite potential: AI can revitalize minority languages. It can generate educational materials, help document vocabulary, serve as an interlocutor in learning processes or support digitalization projects. With political and cultural will, technology can be an ally.

The uneven performance of AI between languages ​​is not just a technical issue: it is a mirror of real-world inequalities. It is not about asking if AI speaks some languages ​​better than others, since the answer is clear: yes, it does. The question is how we can build a future in which technology does not reproduce, but reduces, linguistic gaps.



Source link

Latest Posts

They celebrated "Buenos Aires Coffee Day" with a tour of historic bars - Télam
Cum at clita latine. Tation nominavi quo id. An est possit adipiscing, error tation qualisque vel te.

Categories

Digital vote: Registration deadline expires December 13
Previous Story

Digital vote: Registration deadline expires December 13

The perpetrator of the shooting that left two dead and nine injured on a US university campus remains a fugitive
Next Story

The perpetrator of the shooting that left two dead and nine injured on a US university campus remains a fugitive

Latest from Blog

Traces of Japan in Cuba

Traces of Japan in Cuba

Although discrete in numbers, the presence japanese In Cuba it has been intense in cultural, economic and emotional imprints. Between the memory of the first immigrants and the practices of their descendants,
What neither Polay nor López Chau understand

What neither Polay nor López Chau understand

In an article published in Cambio (February 16, 1989), presidential candidate Alfonso López Chau defines Víctor Polay as a “social fighter” and “political fighter.” What neither Polay nor López Chau understand is
Go toTop