Generative AI is trained on just a few of the world’s 7,000 languages. Here’s why that’s a problem – and what’s being done about it

Oct 6, 2025

Generative AI is mainly trained on the English language, leading to bias and, in some cases, errors with serious consequences. Image: Unsplash/Solen Feyissa

Madeleine North

Senior Writer, Forum Stories

This article has been updated.

Generative AI is mainly trained on the English language, leading to bias and, in some cases, errors with serious consequences.
Companies and governments are taking action and creating their own AI models to ensure more of the world’s 7,000 languages are embedded in the technology.
Preserving cultural heritage is one of the suggested actions put forward in the World Economic Forum’s Presidio Recommendations on Responsible Generative AI.

"Ka pai te AI Whakaputanga i ngā reo?"

According to ChatGPT – and hopefully anyone Māori – the above sentence means, “Is Generative AI good at languages?”

The answer is, yes and no.

With the majority of large language models (LLMs) trained on English text, if you are, say, a student in Odisha, India, using AI to analyze a research paper in your native Odia language, the likes of ChatGPT, Claude and Perplexity may let you down.

Have you read?

This may have serious consequences in some cases. A translator in the US told Reuters Context that four in ten of their Afghan asylum cases derailed in 2023 due to inaccurate AI-driven translation apps. While recent research from Stanford University identified systemic mistranslations occurring regularly in the US legal system.

So what is going on here? There are over 7,000 languages spoken in the world, yet most AI chatbots are trained on around 100 of them. And English, despite being spoken by less than 20% of the world’s population, is used by almost half of websites (as the chart below shows) and is the main driver of LLMs, says the Center for Democracy & Technology (CDT).

Generative AI and its language bias

Inevitably, this linguistic imbalance is leading to issues.

The “insane mistakes” spotted by the asylum application translators included names becoming months, crucial details missing, even immigration sentences being reversed. "The machines themselves are not operating with even a fraction of the quality they need to be able to do casework that's acceptable for someone in a high-stakes situation," Ariel Koren, founder of Respond Crisis Translation, told Reuters Context.

It’s a view shared by CDT’s Gabriel Nicholas and Aliya Bhatia, who point out that, despite the gradual emergence of Multilingual Language Models (MLMs), they “are still usually trained disproportionately on English language text and thus end up transferring values and assumptions encoded in English into other language contexts where they may not belong”. They give the example of the word “dove”, which an MLM might interpret in various languages as being associated with peace, but the Basque equivalent (“uso”) is in fact an insult.

What’s needed is the development of non-English Natural Language Processing (NLP) applications, say experts, to help reduce the language bias in generative AI and “preserve cultural heritage”. The latter is one of 30 suggested actions put forward in the World Economic Forum’s Presidio Recommendations on Responsible Generative AI. “Public and private sector should invest in creating curated datasets and developing language models for underrepresented languages, leveraging the expertise of local communities and researchers and making them available,” it says.

Discover

How the Forum helps leaders make sense of AI and collaborate on responsible innovation

Addressing the AI language bias

There are signs that governments, the tech community and even individuals are taking steps to resolve the AI language issue.

The Indian government is building Bhashini, an AI translation system trained on local languages. There are 22 official ones, but few are currently captured by NLP applications. Indian tech firm Karya is also trying to redress the balance by building datasets for firms like Microsoft and Google to use in AI models. It’s a painstaking process, involving people reading words in their native language into an app.

Launched in the UAE in 2023, Jais AI is an Arabic language model capable of generating high-quality text in Arabic, including regional dialects, says Digital Watch. More recently, the UAE launched Falcon Arabic, which "aims to capture the full linguistic diversity of the Arab world", reports Reuters.

In New Zealand, local broadcaster Te Hiku Media is harnessing AI to aid the “preservation, promotion and revitalization of te reo Māori”, its chief technology officer told Nvidia, which helped create the automatic speech recognition models it says can transcribe te reo with 92% accuracy.

Te Hiku Media's multilingual LLM, called Papa Reo, is the brainchild of Peter Lucas Jones, named by Time Magazine as one of the most influential people in AI in 2024. Talking to the Forum at the Annual Meeting in 2025, Jones said, "I'm fighting to ensure that the Indigenous languages of the Pacific are not counted amongst those that will become extinct".

His team travel around New Zealand capturing people talking in te reo Māori, which is then added to Papa Reo to power Kaituhi, an automatic transcription tool.

"Our focus is on accuracy. Our focus is on precision. Our focus is on maintaining quality. Our focus is on reminding ourselves that one word in Māori can mean up to seven different things," he said.

In a similar endeavour, grassroots organization Masakhane is working to “strengthen and spur NLP research in African languages”. There are around 2,000 languages spoken across Africa, yet they are “barely represented in technology”, it says.

Nigeria's government is also taking action, launching its first multilingual LLM in 2024. “The LLM will be trained on five low-resource languages and accented English to ensure stronger language representation in existing datasets for the development of artificial intelligence solutions,” Dr 'Bosun Tijani, the Minister of Communications, Innovation and Digital Economy, announced on LinkedIn.

Art and AI

Preserving language and culture through technology is of paramount importance to Justin Langan. A Forum Global Shaper, originally from the rural community of Swan River in Manitoba, Canada, Langan has recorded Indigenous elders' stories and created a living repository of these films to ensure their traditional knowledge doesn't disappear.

Langan is also an advocate for bridging the gap between Indigenous Peoples and modern technology. "This knowledge, this Indigenous wisdom can be preserved by integrating a lot of the new technologies we have, within not just Canada, but around the world," he told the Forum. "I'm talking artificial intelligence and utilizing this new technology to maintain and preserve languages, cultures and teachings. Creating apps where Indigenous elders can speak their language and not risk that language being lost."

In the Brazilian Amazon, 300 languages are spoken by Indigenous People, but only a few of the major ones are recognized by LLMs.

After being unable to communicate with the Amazonian community he was living and working with, Turkish artist Refik Anadol – who co-created the indigenous digital artwork Winds of Yawanawa – turned his frustration into action. Anadol has spearheaded the creation of an open-source AI tool “for any Indigenous People” to “preserve their language with technology”, he told the World Economic Forum at the 2023 Annual Meeting in Davos.

“How on Earth can we create an AI that doesn’t know the whole of humanity?” he asked.

With a language “disappearing” at a rate of one every fortnight, according to UNESCO, generative AI could prove to be the death knell, or the saviour, of many of them.

Don't miss any update on this topic

Create a free account and access your personalized content collection with our latest publications and analyses.

License and Republishing

World Economic Forum articles may be republished in accordance with the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License, and in accordance with our Terms of Use.

The views expressed in this article are those of the author alone and not the World Economic Forum.