Home VARSITY NEWS UCT Researchers Unveil AI Model for All 11 Official South African Languages

UCT Researchers Unveil AI Model for All 11 Official South African Languages

117
0
MzansiText dataset, 11 official South African languages AI, University of Cape Town computer science research, African Natural Language Processing, low-resource language AI. UCT MzansiLM South African AI model
UCT researchers have launched MzansiLM, the first AI model trained on all 11 official South African languages.

In a landmark achievement for digital inclusivity, a research team from the University of Cape Town (UCT) has developed MzansiLM, the first publicly available AI language model specifically trained to handle all 11 official written languages of South Africa.

This breakthrough, set to be presented at the Language Resources and Evaluation Conference (LREC) in Spain this month, addresses a critical “data gap” that has long left speakers of indigenous South African languages underserved by global AI giants like ChatGPT and Claude.


The Two Pillars of the Project: MzansiText & MzansiLM

Led by Anri Lombard, Dr. Jan Buys, and Dr. Francois Meyer, the project introduces two vital contributions to the African Natural Language Processing (NLP) landscape:

  1. MzansiText: A curated, high-quality multilingual dataset. While smaller than English datasets, it represents the most comprehensive collection of South African textual data to date.
  2. MzansiLM: A “decoder-only” language model trained from scratch on this data. At 125 million parameters, it is built to serve as a baseline for future researchers.

Why This Matters: Solving the “Low-Resource” Problem

In the world of AI, nine of South Africa’s 11 languages are considered “low-resource.” This means there isn’t enough digitized text (books, articles, websites) for traditional AI to “learn” the language effectively.

While languages like isiZulu and isiXhosa have seen some global interest, others like isiNdebele and Sepedi are often ignored. MzansiLM changes the narrative by including all 11, ensuring no language is left behind in the AI revolution.

“MzansiLM is believed to be the first publicly available decoder-only language model to explicitly target all 11 languages.” — Dr. Francois Meyer, UCT.


Performance: Small but Mighty

Despite being much smaller than commercial models, MzansiLM punches above its weight. In tests, it outperformed much larger open-source models on specific benchmarks. For example, its isiXhosa text generation competed with models ten times its size.

What MzansiLM is:

  • A foundation model for developers to build specific tools.
  • A tool for summarizing information or annotating data.
  • An affordable alternative to proprietary models for local use cases.

What it is NOT:

  • A general-purpose chatbot (it’s not a direct competitor to ChatGPT yet).
  • A tool for complex instruction-following (due to limited training data).

An Open Future for African AI

The UCT team has stayed true to the spirit of the African NLP community by making both the model and the dataset publicly available on arXiv. By sharing their code and findings, they are inviting the global research community to reproduce, refine, and expand on this South African foundation.

LEAVE A REPLY

Please enter your comment!
Please enter your name here