SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-10
Hauptverfasser: Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Miranda, Lester James V, Santoso, Jennifer, Aco, Elyanah, Akhdan Fadhilah, Mansurov, Jonibek, Imperial, Joseph Marvin, Kampman, Onno P, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Hudi, Frederikus, Railey Montalan, Ryan, Ignatius, Joanito Agili Lopo, Nixon, William, Karlsson, Börje F, Jaya, James, Diandaru, Ryandito, Gao, Yuze, Amadeus, Patrick, Wang, Bin, Blaise Cruz, Jan Christian, Whitehouse, Chenxi, Ivan Halim Parmonangan, Khelli, Maria, Zhang, Wenyu, Susanto, Lucky, Reynard Adha Ryanda, Hermawan, Sonny Lazuardi, Velasco, Dan John, Muhammad Dehan Al Kautsar, Hendria, Willy Fitra, Moslem, Yasmin, Flynn, Noah, Muhammad Farid Adilazuarda, Li, Haochen, Lee, Johanes, Damanhuri, R, Sun, Shuo, Qorib, Muhammad Reza, Djanibekov, Amirbek, Wei Qi Leong, Do, Quyet V, Muennighoff, Niklas, Pansuwan, Tanrada, Putra, Ilham Firdausi, Xu, Yan, Ngee Chia Tai, Purwarianti, Ayu, Ruder, Sebastian, Tjhi, William, Limkonchotiwat, Peerat, Aji, Alham Fikri, Keh, Sedrick, Genta Indra Winata, Zhang, Ruochen, Koto, Fajri, Zheng-Xin, Yong, Cahyawijaya, Samuel
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
ISSN:2331-8422