Now you are speaking my language: why minoritised LLMs matter
How to ensure AI systems in 'low-resource' languages thrive
28 November 2024
Reading time: 13 minutes
The problem with dominant systems
Imagine Tizita, an eager computer science student in Addis Ababa who is passionate about building a career in AI. She dreams of creating AI-driven solutions that address local challenges, whether it’s improving healthcare in rural communities or helping native Amharic language speakers use their first language in commercial settings. But when she tries to work with mainstream AI platforms, she finds a frustrating reality: they don’t support her language or the customs of her culture, such as the intertwining of time and religion and years being divided into 13 months instead of 12. These platforms assume Anglo-centric standards as the default, leaving her and millions of others struggling to adapt a technology that was not built for them.
This all-too-common scenario highlights a critical flaw in the way AI systems, and especially large language models (LLMs), are developed.
Over the past few decades, English has not only become the primary language of communication in AI research, it has also been the main focus of research in computational linguistics – the interdisciplinary field combining linguistics and natural language processing (NLP) by programming computational linguistic models.
As half of the internet’s content is in English language, it is no surprise that the vast majority of data used to train AI language models is English language data. In fact, this has led to researchers achieving impressive results in language tasks, predominantly in North America and Europe. Capitalising on the linguistic uniformity of the data available, companies have developed successful chatbots and machine translation apps.
Researchers have recently shifted their focus to multilingual language technologies to produce machine translation systems that also work with ‘low-resource’ languages, and NLP tools that use data from a broader range of languages. While this sounds like a step in the right direction, it is a long way from solving the problems experienced by Tizita and others.
‘Low-resource’, high stakes
A language is described as ‘low-resource’ (or ‘under-resourced’) when there is a scarcity of high-quality text data in that language, and there is a lack of sufficient documentation, technological support and educational resources. This can be due to colonisation – where dominant languages displaced or marginalised local languages and exploited the resources of those who speak them; insufficient technological infrastructure; or lack of expert computer scientists who are fluent in a specific language and who can ethically capture the missing data.
As a result, LLMs are trained on languages widely spoken by resource-rich nations, leaving minority languages underrepresented and leading to a reinforcement of the imbalances in the datasets.
The term ‘low-resource’ (or ‘under-resourced’) in relation to language is itself debated. It is often applied to any language other than English, without reference to the unique challenges specific language groups face to improve relevant datasets and language models.
In a similar way to the problematic use of BAME (Black, Asian and minority ethnic), sometimes used to describe people who are not white and British, the term ‘low-resource’ language is also imperfect in that it encapsulates a range of languages that are vastly different, and so require separate consideration, rather than the catch-all label of ‘low-resource’.
For example, Basque is one of few languages with no demonstrable relationship with any other language (a language isolate), and is only spoken by about 800,000 people. Kiswahili by comparison, is spoken by 60–150 million people across several African countries. However, Basque is far better represented online (with six times more Wikipedia articles in Basque than in Kiswahili). Basque is spoken in a relatively wealthy country, whereas Kiswahili in a region of the world whose resources have and are still being exploited. If both Basque and Kiswahili are labelled ‘low-resource’ languages, as it is the case in comparison to English, it should be for different reasons that reflect their respective histories.
The confusion around the status of ‘low-resource’ languages signals a deeper issue, as researchers with little understanding of their grammar, content and cultural contexts crudely scrape the internet to compensate for the lack of relevant data. They train existing models on this data to achieve marginal improvements in tasks such as machine translation from, for example, English into Kiswahili. The effect is that, instead of creating truly inclusive systems that account for nuanced cultural and linguistic differences by collaborating with native speakers, more languages are simply digitised and integrated into the existing infrastructure.
This results in language modelling biases, with multilingual systems developed from an Anglo-centric perspective and deemed as working ‘well enough’ from that dominant viewpoint, ignoring the potential for westernised cultural homogenisation. As the computational language gap is only superficially bridged, non-Anglophone users are left with the sole option of adapting their communication to fit the status quo. Instead of attributing equal value to different systems of knowledge, mainstream AI platforms amplify pre-existing epistemic injustices.
More than lost in translation
Language affects how we communicate with and relate to others. We share how we perceive the world and ourselves in languages that change in time just as we do. Language is a vessel for culture, identity and knowledge, with single languages encompassing multiple cultures, worldviews and political perspectives. So how can we expect a one-size-fits-all computational system to express the nuances of all world languages?
When LLMs fail to support languages other than English, they do more than excluding non-English speakers. They erase cultural histories and ways of thinking. This erasure is all the more concerning in a world where AI is becoming integrated into our daily lives and interacts with essential services, from education to healthcare, and high-stake functions of government, like border control.
In a recent example, AI apps have failed asylum seekers, with some detained for months due to translation inaccuracies rooted in linguistic and cultural misrepresentations. What may look like instances of simple ‘lost-in-translation’ are in fact iterations of cultural, racial and gender discrimination. When immigration systems rely on translation tools that are inherently limited, human lives are at risk. Accurate LLMs become a gateway for survival, when they can improve systems that lack human translators and support people who’ve gone through traumatising experiences and may be deported or barred from accessing medical care due to a single translation mistake.
Minoritised languages and alternative systems
As with other kinds of social structures, when a dominant system is adopted as the default by a majority community, but fails to serve a minority community in an equal way, underrepresented groups create their own systems. In the Black feminist movement, for instance, Black women have historically had to build their own frameworks of resistance and empowerment not only to address the inadequacies of traditional feminism but also to tackle the unique intersectional struggles they face.
These minoritised systems, however, tend to be affected by two major issues. First, they are often perceived as responses to the shortcomings of dominant systems, rather than as original and autonomous structures with their own legitimacy and singular perspectives. Second, they tend to populate fragmented ecosystems, in which efforts are replicated and communities do not communicate with each other or collaborate.
Some AI companies based in the African continent have been working to overcome both issues. Lelapa AI, for example, has been developing AI technologies specifically tailored to African languages and cultural contexts. Its research programme aims at building something new and valuable from the ground up rather than fixing the problems of existing systems. To provide access to AI to people whose first language is neither of the two main colonial languages in Africa – English and French – Lelapa AI has focused on specific South African languages, such as isiZulu and Sesotho.
This type of research is not just crucial for the communities it serves. It also offers a model for how minoritised systems can innovate and contribute to the broader technological ecosystem.
In South Africa, for instance, the existing languages tell the history of oppression under the apartheid regime: how people were discriminated against based on both their race and their first language. As English was the main language in scientific contexts, many indigenous communities were never given the opportunity to even communicate scientifically in their own language. Thus, there is no isiZulu word for ‘dinosaur’ or ‘evolution’. As NLP research develops LLMs based on existing languages, we must avoid replicating long-standing racist dynamics. The lack of documentation of scientific words in certain languages should not mean they are not included in datasets. In an effort to decolonise science, initiatives by organisations like Masakhane are working on translating scientific papers into various ‘low-resource’ languages, thereby creating new terminology and expanding datasets for further training of LLMs.
Kenyan writer Ngũgĩ wa Thiong’o’s ‘Decolonising the Mind’ and Ghanian sociology professor Kwesi Kwaa Prah have been advocating for a shift in how we think about language and technology, contributing to a new narrative that reclaims and decolonises scientific communication. At the same time, data and AI communities such as Masakhane and the Māori Data Sovereignty Network have gained attention and are now involved in NLP research in indigenous languages that directly engages indigenous researchers. The goal is to build a vocabulary and platform for people to talk about their research in their own language and thereby take cultural ownership over science.
Emerging communities with similar objectives often operate with limited material resources and capacity, in contexts where it is difficult to acquire adequate skills. For instance, they might lack a structured data ecosystem; have insufficient network infrastructure and connectivity; and be at a clear disadvantage in a highly competitive international market that does not care about their needs.
With this picture in mind, teams researching NLP in minoritised languages must grapple with fundamental questions concerning their ultimate goal and how to move towards it. Should researchers aim for a unified global system that adequately represents minoritised groups in a single language model like ChatGPT? Or should they instead strive for a world of smaller, autonomous systems designed by and for minoritised communities? Is the latter even possible, from both a technical and political perspective, in the context of power dynamics that have been favouring Western European and North American countries and have led to an increased cultural homogenisation?
Unified versus pluralist systems?
A unified system, where all languages and cultures are equally integrated into a single model (a sort of universal ChatGPT), can offer representative AI tools on a broad scale due to its global accessibility. The increased accessibility could lead to more funding for the organisations building it, as a system running on a global platform and offering the same applications to a worldwide audience would appeal to companies trying to expand their market. Increased investments, in turn, may enable better system maintenance and further development to integrate the system into other applications.
However, the success of this market-orientated dynamic relies on companies being committed to investing time, resources and serving the interest of marginalised communities. And apparently easy alternatives, such as making models open-source to foster their use and autonomous development, are unlikely to work. Instead of ‘democratising’ AI, as some companies have framed their open-sourcing strategy, this only creates new power imbalances. When a company grants free access to its developing AI products, it enables them for any purpose, including highly problematic ones, without having to deal with the consequences.
At the same time, the global collaboration needed among committed partners to create a truly inclusive system is difficult to achieve. This is especially the case in regions facing severe crises like wars, environmental catastrophes or political tensions.
Above all, the main hurdle seems to be the objective itself of this effort: trying to generalise something, a linguistic system, that may be impossible to generalise without compromises.
Communities like Lelapa AI or Masakhane, which focus on a selection of African languages and seek to address dataset imbalances, can still reproduce existing discriminations and serve only the more widely spoken African languages. A circular problem: if there is no community building a minoritised system for a specific ‘low-resource’ language, then there is no system available for those who need it.
In contrast, there are several advantages to developing minoritised systems to create a pluralist infrastructure. Tailored to the needs of the linguistic group it serves, each system may provide more accurate results for the specific language it was built to encode, thereby showing better translations and quality of content. At the same time, by truly respecting cultures and languages, minoritised systems enable greater community involvement, reducing the need for compromising with dominant systems. This leads to more ethical and respectful treatment of collaborators, aiming to put an end to the current exploitation and underpayment of the global workforce that makes the fast growth of LLMs even possible.
However, this pluralism also comes with challenges. Systems trained on datasets of different sizes will offer user experiences of different quality. An ecosystem made of many discrete initiatives could mean that those working on dominant systems will be less incentivised to collaborate and contribute to improvements, leaving minority communities to fend for themselves. How do we ensure that minoritised systems can thrive and avoid becoming isolated or marginalised in new ways?
Challenges and questions for minoritised AI systems
In political terms, a unified approach may be fundamentally at odds with a pluralist society. No LLM can truly be apolitical and will always reflect the values and priorities of those who built it. A unified artificial system may not adequately represent differing or even opposing positions.
Questions such as which data is used to train a model, and who plans the training on ‘low-resource’ language datasets (which may be western companies) are important.
Who gets to decide whether a system is valid or not? Is it the responsibility of the developers and the company who built it, or should countries whose languages are being used have authority? This is further complicated by the fact that systems are often built by companies based abroad. To what extent do native speakers own the rights to govern their own language, especially when their language is only a part of a bigger system? Perhaps we start by asking what kind of society we want to live in and what ecosystem of LLMs would be compatible with that.
One potential solution lies in fostering collaboration and community involvement.
The Universal Knowledge Core (UKC) and their LiveLanguage initiative or the PanLex database are both great examples of lexical translation databases that keep being updated collaboratively. The databases are a first of their kind, as they represent over 2,000 language lexica, showing both what is and what isn’t shared across languages and cultures.
They include words that have similar meanings in multiple languages as well as untranslatable terms or language-specific grammar. In this way, diversity-aware language resources and an ever-growing network of international collaborators can more effectively tackle language modelling biases together. The databases are panlingual, aiming to cover as many global languages as possible. While machine translation built on these databases is still limited to a few hundred languages, the lexical datasets offer dictionaries for thousands of languages and can support the development of minoritised language models.
More of this type of initiative could be beneficial in creating concrete solutions to the present epistemic injustice that risks further (computational) silencing of marginalised language communities. This is not an issue of comfort: people’s rights are at risk of being restricted based on the languages they speak. One of the alternatives points towards a pluralist system that thrives through community-based and collaborative development.