Text as Corpus Repository for Multilingual Machine Translation of Low-resource Languages

^{Christian Schuler, Deepesha Saurty, Tramy Thi Tran}

Almost half of the approximately 7,000 currently spoken languages are expected to become extinct this century. It is estimated that less than 5% of these will be used online or have a significant digital presence. The lack of resources, including language data and translation systems, hinders effective communication and understanding across many languages. This poses a considerable problem in promoting inclusivity and cultural exchange.

The aim of our project is to collect and curate language text data to support natural language processing, especially the development of robust translation systems for low-resource languages. Socially, this project aims to empower marginalized language communities and bridge communication gaps, promoting linguistic preservation and cultural diversity. Scientifically, it contributes to the field of language technology and translation systems for low-resource languages, filling a critical research gap.

Mauritian Creole (Morisien) is spoken on Mauritius, an island nation southeast of the African continent. It was very recently that the Mauritian Creole Academy promoted a standardized spelling (Lortograf Kreol Morisien), which, even though supported by the Mauritian government, has not yet been adopted by the general population. Since large parts of the population still write the way they feel inclined to, alternative spellings can be found for many words. With approximately 1.3 million people speaking Morisien, it can be considered a relatively small language community. Developing or even evaluating machine translation for a language is impossible without publicly available datasets, which, for Morisien, are currently still lacking.

Kobani, a subdialect of the Northern Kurdish (Kurmanji), is spoken in the north of Syria. As the computer-based natural language processing for the Kurdish language is still very much in its early days, only a few applications exist today, let alone free and openly available ones. Scientific work on the Kurdish language also tends to focus on a few dialects and sometimes even merely a single dialect, most of the time Central Kurdish, also called Sorani. Regarding Kurmanji, one of the major dialects of the Kurdish language, with even more native speakers than Sorani, Haig and Öpengin (2014, p. 144) write: “Like any other natural language, Kurmanji encompasses a considerable spectrum of regional variation. Yet within academia, regional variation in Kurmanji has been almost entirely neglected.”

Vietnamese is spoken in Vietnam, in the southeast of Asia. Vietnamese has various dialects and a vocabulary influenced by Chinese and French. While Vietnamese has many more native speakers and a stronger digital presence compared to our other two target languages, it is still a low-resource language for which applications such as Google Translate struggle to offer satisfactory translations.

We deem it important to include the language communities and native speakers as part of our project. First for proper considerations and alignment of scientific goals with human desires, and later to guarantee high data quality. Collecting more low-quality data wouldn’t be prudent, and only the highest quality might be the inkling of a chance to counter-balance our target languages' severe data scarcity today.

References

Öpengin, E. & Haig, G. (2014). Regional variation in Kurmanji: A preliminary classification of dialects. Kurdish Studies (2:2), 143-176.

Studierendenprojekt: Text as Corpus Repository for Multilingual Machine Translation of Low-resource Languages

Förderzeitraum: 01.10.2023 - 31.03.2024 (6 Monate)

Studierende: Christian Schuler, Deepesha Saurty, Tramy Thi Tran

Mentor: Dr. Seid Muhie Yimam