The State of Multilingual LLMs: Moving Beyond English

The State of Multilingual LLMs: Moving Beyond English


According to Microsoft research, around 88% of the world’s languages, spoken by 1.2 billion people, lack access to Large Language Models (LLMs). This is because most LLMs are English-centered, i.e., they are mostly built with English data and for English speakers. ​This English dominance also prevails in LLM development and has resulted in a digital language gap, potentially excluding most people from the benefits of LLMs. To solve this problem for LLMs, an LLM that can be trained in different languages and perform tasks in different languages is needed. Enter Multilingual LLMs!

What are Multilingual LLMs?

A multilingual LLM can understand and generate text in multiple languages. They are trained on datasets that contain different languages and can take on various tasks in more than one language from a user’s prompt.

Multilingual LLM applications are enormous, they include translating literature into local dialects, real-time multilingual communication, multilingual content creation, etc. They would help everyone access information and talk to each other easily, no matter their language.

Also, multilingual LLMs address challenges such as lack of cultural nuances and context, training data limitations, and the potential loss of knowledge during translation.

How do Multilingual LLMs Work?

Building a multilingual LLM involves carefully preparing a balanced corpus of text in various languages and selecting a suitable architecture and training technique for training the model, preferably a Transformer model, which is perfect for multilingual learning.

Steps to build a multilingual LLM

Source: Image by author

One technique is to share embeddings, which capture the semantic meaning of words across different languages. This makes the LLM learn the similarities and differences of each language, enabling it to understand the different languages better.

This knowledge also empowers the LLM to adapt to various linguistic tasks, like translating languages, writing in different styles, etc. Another technique used is cross-lingual transfer learning, where the model is pre-trained on a large corpus of multilingual data before being fine-tuned on specific tasks.

This two-step process ensures the model has a strong foundation in multilingual language understanding, making it adaptable to various downstream applications.

Examples of Multilingual Large Language Models

Multilingual LLM comparison chart

Source: Ruder.io

Several notable examples of multilingual LLMs have emerged, each catering to specific linguistic needs and cultural contexts. Let’s explore a few of them:

1. BLOOM

BLOOM is an open-access multilingual LLM that prioritizes diverse languages and accessibility. With 176 billion parameters, BLOOM can handle tasks in 46 natural and 13 programming languages, making it one of the biggest and most diverse LLMs.

BLOOM’s open-source nature allows researchers, developers, and language communities to benefit from its capabilities and contribute to its improvement.

2. YAYI 2

YAYI 2 is an open-source LLM designed specifically for Asian languages, considering the region’s complexities and cultural nuances. It was pre-trained from scratch on a multilingual corpus of over 16 Asian languages containing 2.65 trillion filtered tokens.

This makes the model give better results, meeting the specific requirements of languages and cultures in Asia.

3. PolyLM

PolyLM is an open-source ‘polyglot’ LLM that focuses on addressing the challenges of low-resource languages by offering adaptation capabilities. It was trained on a dataset of about 640 billion tokens and is available in two model sizes: 1.7B and 13B. PolyLM knows over 16 different languages.

It enables models trained on high-resource languages to be fine-tuned for low-resource languages with limited data. This flexibility makes LLMs more useful in different language situations and tasks.

4. XGLM

XGLM, boasting 7.5 billion parameters, is a multilingual LLM trained on a corpus covering a diverse set of over 20 languages using the few-shot learning technique. It is part of a family of large-scale multilingual LLMs trained on a massive dataset of text and code.

It aims to cover many languages completely, which is why it focuses on inclusivity and linguistic diversity. XGLM demonstrates the potential for building models catering to the needs of various language communities.

5.  mT5

The mT5 (massively multilingual Text-to-Text Transfer Transformer) was developed by Google AI. Trained on the common crawl dataset, mt5 is a state-of-the-art multilingual LLM that can handle 101 languages, ranging from widely spoken Spanish and Chinese to less-resourced languages like Basque and Quechua.

It also excels at multilingual tasks like translation, summarization, question-answering, etc.

Is a Universal LLM Possible?

The concept of a language-neutral LLM, capable of understanding and generating language without bias towards any particular language, is intriguing.

While developing a truly universal LLM is still far away, current multilingual LLMs have demonstrated significant success. Once developed fully, they can cater to the needs of under-represented languages and diverse communities.

For instance, research shows that most multilingual LLMs can facilitate zero-shot cross-lingual transfer from a resource-rich language to a resource-deprived language without task-specific training data.

Also, models like YAYI and BLOOM, which focus on specific languages and communities, have demonstrated the potential of language-centric approaches in driving progress and inclusivity.

To build a universal LLM or improve current Multilingual LLMs, individuals and organizations must do the following:

  • Crowdsource native speakers for community engagement and curation of the language datasets.
  • Support community efforts regarding open-source contributions and funding to multilingual research and developments.

Challenges of Multilingual LLMs

While the concept of universal multilingual LLMs holds great promise, they also face several challenges that must be addressed before we can benefit from them:

1. Data Quantity

Multilingual models require a larger vocabulary to represent tokens in many languages than monolingual models, but many languages lack large-scale datasets. This makes it difficult to train these models effectively.

2. Data Quality Concerns

Ensuring the accuracy and cultural appropriateness of multilingual LLM outputs across languages is a significant concern. Models must train and fine-tune with meticulous attention to linguistic and cultural nuances to avoid biases and inaccuracies.

3. Resource Limitations

Training and running multilingual models require substantial computational resources such as powerful GPUs (e.g., NVIDIA A100 GPU). The high cost poses challenges, particularly for low-resource languages and communities with limited access to computational infrastructure.

4. Model Architecture

Adapting model architectures to accommodate diverse linguistic structures and complexities is an ongoing challenge. Models must be able to handle languages with different word orders, morphological variations, and writing systems while maintaining high performance and efficiency.

5. Evaluation Complexities

Evaluating the performance of multilingual LLMs beyond English benchmarks is critical for measuring their true effectiveness. It requires considering cultural nuances, linguistic peculiarities, and domain-specific requirements.

Multilingual LLMs have the potential to break language barriers, empower under-resourced languages, and facilitate effective communication across diverse communities.

Don’t miss out on the latest news and analysis in AI and ML – visit unite.ai today.

Deixe um comentário