SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-10
Hauptverfasser: Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Miranda, Lester James V, Santoso, Jennifer, Aco, Elyanah, Akhdan Fadhilah, Mansurov, Jonibek, Imperial, Joseph Marvin, Kampman, Onno P, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Hudi, Frederikus, Railey Montalan, Ryan, Ignatius, Joanito Agili Lopo, Nixon, William, Karlsson, Börje F, Jaya, James, Diandaru, Ryandito, Gao, Yuze, Amadeus, Patrick, Wang, Bin, Blaise Cruz, Jan Christian, Whitehouse, Chenxi, Ivan Halim Parmonangan, Khelli, Maria, Zhang, Wenyu, Susanto, Lucky, Reynard Adha Ryanda, Hermawan, Sonny Lazuardi, Velasco, Dan John, Muhammad Dehan Al Kautsar, Hendria, Willy Fitra, Moslem, Yasmin, Flynn, Noah, Muhammad Farid Adilazuarda, Li, Haochen, Lee, Johanes, Damanhuri, R, Sun, Shuo, Qorib, Muhammad Reza, Djanibekov, Amirbek, Wei Qi Leong, Do, Quyet V, Muennighoff, Niklas, Pansuwan, Tanrada, Putra, Ilham Firdausi, Xu, Yan, Ngee Chia Tai, Purwarianti, Ayu, Ruder, Sebastian, Tjhi, William, Limkonchotiwat, Peerat, Aji, Alham Fikri, Keh, Sedrick, Genta Indra Winata, Zhang, Ruochen, Koto, Fajri, Zheng-Xin, Yong, Cahyawijaya, Samuel
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Holy Lovenia
Rahmad Mahendra
Salsabil Maulana Akbar
Miranda, Lester James V
Santoso, Jennifer
Aco, Elyanah
Akhdan Fadhilah
Mansurov, Jonibek
Imperial, Joseph Marvin
Kampman, Onno P
Joel Ruben Antony Moniz
Muhammad Ravi Shulthan Habibi
Hudi, Frederikus
Railey Montalan
Ryan, Ignatius
Joanito Agili Lopo
Nixon, William
Karlsson, Börje F
Jaya, James
Diandaru, Ryandito
Gao, Yuze
Amadeus, Patrick
Wang, Bin
Blaise Cruz, Jan Christian
Whitehouse, Chenxi
Ivan Halim Parmonangan
Khelli, Maria
Zhang, Wenyu
Susanto, Lucky
Reynard Adha Ryanda
Hermawan, Sonny Lazuardi
Velasco, Dan John
Muhammad Dehan Al Kautsar
Hendria, Willy Fitra
Moslem, Yasmin
Flynn, Noah
Muhammad Farid Adilazuarda
Li, Haochen
Lee, Johanes
Damanhuri, R
Sun, Shuo
Qorib, Muhammad Reza
Djanibekov, Amirbek
Wei Qi Leong
Do, Quyet V
Muennighoff, Niklas
Pansuwan, Tanrada
Putra, Ilham Firdausi
Xu, Yan
Ngee Chia Tai
Purwarianti, Ayu
Ruder, Sebastian
Tjhi, William
Limkonchotiwat, Peerat
Aji, Alham Fikri
Keh, Sedrick
Genta Indra Winata
Zhang, Ruochen
Koto, Fajri
Zheng-Xin, Yong
Cahyawijaya, Samuel
description Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3068911031</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3068911031</sourcerecordid><originalsourceid>FETCH-proquest_journals_30689110313</originalsourceid><addsrcrecordid>eNqNjMsKwjAURIMgWLT_cMF1IU20Vnf1RRe6qmvlatOXbaJ54O9b0Q9wNXOYwwyIxzgPg3jG2Ij4xjSUUhYt2HzOPXLOdslGq1e-ggSOrrV1W8vSYfuFTuV93aJFSN0VUOawFvJWdajvkLnaCiiUhkw5Wwk0FhJTo4QDfj5KYSZkWGBrhP_LMZnud6dNGjy0ejph7KVRTst-unAaxcswpDzk_1lvnPhC8A</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3068911031</pqid></control><display><type>article</type><title>SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages</title><source>Free E- Journals</source><creator>Holy Lovenia ; Rahmad Mahendra ; Salsabil Maulana Akbar ; Miranda, Lester James V ; Santoso, Jennifer ; Aco, Elyanah ; Akhdan Fadhilah ; Mansurov, Jonibek ; Imperial, Joseph Marvin ; Kampman, Onno P ; Joel Ruben Antony Moniz ; Muhammad Ravi Shulthan Habibi ; Hudi, Frederikus ; Railey Montalan ; Ryan, Ignatius ; Joanito Agili Lopo ; Nixon, William ; Karlsson, Börje F ; Jaya, James ; Diandaru, Ryandito ; Gao, Yuze ; Amadeus, Patrick ; Wang, Bin ; Blaise Cruz, Jan Christian ; Whitehouse, Chenxi ; Ivan Halim Parmonangan ; Khelli, Maria ; Zhang, Wenyu ; Susanto, Lucky ; Reynard Adha Ryanda ; Hermawan, Sonny Lazuardi ; Velasco, Dan John ; Muhammad Dehan Al Kautsar ; Hendria, Willy Fitra ; Moslem, Yasmin ; Flynn, Noah ; Muhammad Farid Adilazuarda ; Li, Haochen ; Lee, Johanes ; Damanhuri, R ; Sun, Shuo ; Qorib, Muhammad Reza ; Djanibekov, Amirbek ; Wei Qi Leong ; Do, Quyet V ; Muennighoff, Niklas ; Pansuwan, Tanrada ; Putra, Ilham Firdausi ; Xu, Yan ; Ngee Chia Tai ; Purwarianti, Ayu ; Ruder, Sebastian ; Tjhi, William ; Limkonchotiwat, Peerat ; Aji, Alham Fikri ; Keh, Sedrick ; Genta Indra Winata ; Zhang, Ruochen ; Koto, Fajri ; Zheng-Xin, Yong ; Cahyawijaya, Samuel</creator><creatorcontrib>Holy Lovenia ; Rahmad Mahendra ; Salsabil Maulana Akbar ; Miranda, Lester James V ; Santoso, Jennifer ; Aco, Elyanah ; Akhdan Fadhilah ; Mansurov, Jonibek ; Imperial, Joseph Marvin ; Kampman, Onno P ; Joel Ruben Antony Moniz ; Muhammad Ravi Shulthan Habibi ; Hudi, Frederikus ; Railey Montalan ; Ryan, Ignatius ; Joanito Agili Lopo ; Nixon, William ; Karlsson, Börje F ; Jaya, James ; Diandaru, Ryandito ; Gao, Yuze ; Amadeus, Patrick ; Wang, Bin ; Blaise Cruz, Jan Christian ; Whitehouse, Chenxi ; Ivan Halim Parmonangan ; Khelli, Maria ; Zhang, Wenyu ; Susanto, Lucky ; Reynard Adha Ryanda ; Hermawan, Sonny Lazuardi ; Velasco, Dan John ; Muhammad Dehan Al Kautsar ; Hendria, Willy Fitra ; Moslem, Yasmin ; Flynn, Noah ; Muhammad Farid Adilazuarda ; Li, Haochen ; Lee, Johanes ; Damanhuri, R ; Sun, Shuo ; Qorib, Muhammad Reza ; Djanibekov, Amirbek ; Wei Qi Leong ; Do, Quyet V ; Muennighoff, Niklas ; Pansuwan, Tanrada ; Putra, Ilham Firdausi ; Xu, Yan ; Ngee Chia Tai ; Purwarianti, Ayu ; Ruder, Sebastian ; Tjhi, William ; Limkonchotiwat, Peerat ; Aji, Alham Fikri ; Keh, Sedrick ; Genta Indra Winata ; Zhang, Ruochen ; Koto, Fajri ; Zheng-Xin, Yong ; Cahyawijaya, Samuel</creatorcontrib><description>Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Datasets ; Languages ; Native languages</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Holy Lovenia</creatorcontrib><creatorcontrib>Rahmad Mahendra</creatorcontrib><creatorcontrib>Salsabil Maulana Akbar</creatorcontrib><creatorcontrib>Miranda, Lester James V</creatorcontrib><creatorcontrib>Santoso, Jennifer</creatorcontrib><creatorcontrib>Aco, Elyanah</creatorcontrib><creatorcontrib>Akhdan Fadhilah</creatorcontrib><creatorcontrib>Mansurov, Jonibek</creatorcontrib><creatorcontrib>Imperial, Joseph Marvin</creatorcontrib><creatorcontrib>Kampman, Onno P</creatorcontrib><creatorcontrib>Joel Ruben Antony Moniz</creatorcontrib><creatorcontrib>Muhammad Ravi Shulthan Habibi</creatorcontrib><creatorcontrib>Hudi, Frederikus</creatorcontrib><creatorcontrib>Railey Montalan</creatorcontrib><creatorcontrib>Ryan, Ignatius</creatorcontrib><creatorcontrib>Joanito Agili Lopo</creatorcontrib><creatorcontrib>Nixon, William</creatorcontrib><creatorcontrib>Karlsson, Börje F</creatorcontrib><creatorcontrib>Jaya, James</creatorcontrib><creatorcontrib>Diandaru, Ryandito</creatorcontrib><creatorcontrib>Gao, Yuze</creatorcontrib><creatorcontrib>Amadeus, Patrick</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><creatorcontrib>Blaise Cruz, Jan Christian</creatorcontrib><creatorcontrib>Whitehouse, Chenxi</creatorcontrib><creatorcontrib>Ivan Halim Parmonangan</creatorcontrib><creatorcontrib>Khelli, Maria</creatorcontrib><creatorcontrib>Zhang, Wenyu</creatorcontrib><creatorcontrib>Susanto, Lucky</creatorcontrib><creatorcontrib>Reynard Adha Ryanda</creatorcontrib><creatorcontrib>Hermawan, Sonny Lazuardi</creatorcontrib><creatorcontrib>Velasco, Dan John</creatorcontrib><creatorcontrib>Muhammad Dehan Al Kautsar</creatorcontrib><creatorcontrib>Hendria, Willy Fitra</creatorcontrib><creatorcontrib>Moslem, Yasmin</creatorcontrib><creatorcontrib>Flynn, Noah</creatorcontrib><creatorcontrib>Muhammad Farid Adilazuarda</creatorcontrib><creatorcontrib>Li, Haochen</creatorcontrib><creatorcontrib>Lee, Johanes</creatorcontrib><creatorcontrib>Damanhuri, R</creatorcontrib><creatorcontrib>Sun, Shuo</creatorcontrib><creatorcontrib>Qorib, Muhammad Reza</creatorcontrib><creatorcontrib>Djanibekov, Amirbek</creatorcontrib><creatorcontrib>Wei Qi Leong</creatorcontrib><creatorcontrib>Do, Quyet V</creatorcontrib><creatorcontrib>Muennighoff, Niklas</creatorcontrib><creatorcontrib>Pansuwan, Tanrada</creatorcontrib><creatorcontrib>Putra, Ilham Firdausi</creatorcontrib><creatorcontrib>Xu, Yan</creatorcontrib><creatorcontrib>Ngee Chia Tai</creatorcontrib><creatorcontrib>Purwarianti, Ayu</creatorcontrib><creatorcontrib>Ruder, Sebastian</creatorcontrib><creatorcontrib>Tjhi, William</creatorcontrib><creatorcontrib>Limkonchotiwat, Peerat</creatorcontrib><creatorcontrib>Aji, Alham Fikri</creatorcontrib><creatorcontrib>Keh, Sedrick</creatorcontrib><creatorcontrib>Genta Indra Winata</creatorcontrib><creatorcontrib>Zhang, Ruochen</creatorcontrib><creatorcontrib>Koto, Fajri</creatorcontrib><creatorcontrib>Zheng-Xin, Yong</creatorcontrib><creatorcontrib>Cahyawijaya, Samuel</creatorcontrib><title>SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages</title><title>arXiv.org</title><description>Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.</description><subject>Benchmarks</subject><subject>Datasets</subject><subject>Languages</subject><subject>Native languages</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNjMsKwjAURIMgWLT_cMF1IU20Vnf1RRe6qmvlatOXbaJ54O9b0Q9wNXOYwwyIxzgPg3jG2Ij4xjSUUhYt2HzOPXLOdslGq1e-ggSOrrV1W8vSYfuFTuV93aJFSN0VUOawFvJWdajvkLnaCiiUhkw5Wwk0FhJTo4QDfj5KYSZkWGBrhP_LMZnud6dNGjy0ejph7KVRTst-unAaxcswpDzk_1lvnPhC8A</recordid><startdate>20241008</startdate><enddate>20241008</enddate><creator>Holy Lovenia</creator><creator>Rahmad Mahendra</creator><creator>Salsabil Maulana Akbar</creator><creator>Miranda, Lester James V</creator><creator>Santoso, Jennifer</creator><creator>Aco, Elyanah</creator><creator>Akhdan Fadhilah</creator><creator>Mansurov, Jonibek</creator><creator>Imperial, Joseph Marvin</creator><creator>Kampman, Onno P</creator><creator>Joel Ruben Antony Moniz</creator><creator>Muhammad Ravi Shulthan Habibi</creator><creator>Hudi, Frederikus</creator><creator>Railey Montalan</creator><creator>Ryan, Ignatius</creator><creator>Joanito Agili Lopo</creator><creator>Nixon, William</creator><creator>Karlsson, Börje F</creator><creator>Jaya, James</creator><creator>Diandaru, Ryandito</creator><creator>Gao, Yuze</creator><creator>Amadeus, Patrick</creator><creator>Wang, Bin</creator><creator>Blaise Cruz, Jan Christian</creator><creator>Whitehouse, Chenxi</creator><creator>Ivan Halim Parmonangan</creator><creator>Khelli, Maria</creator><creator>Zhang, Wenyu</creator><creator>Susanto, Lucky</creator><creator>Reynard Adha Ryanda</creator><creator>Hermawan, Sonny Lazuardi</creator><creator>Velasco, Dan John</creator><creator>Muhammad Dehan Al Kautsar</creator><creator>Hendria, Willy Fitra</creator><creator>Moslem, Yasmin</creator><creator>Flynn, Noah</creator><creator>Muhammad Farid Adilazuarda</creator><creator>Li, Haochen</creator><creator>Lee, Johanes</creator><creator>Damanhuri, R</creator><creator>Sun, Shuo</creator><creator>Qorib, Muhammad Reza</creator><creator>Djanibekov, Amirbek</creator><creator>Wei Qi Leong</creator><creator>Do, Quyet V</creator><creator>Muennighoff, Niklas</creator><creator>Pansuwan, Tanrada</creator><creator>Putra, Ilham Firdausi</creator><creator>Xu, Yan</creator><creator>Ngee Chia Tai</creator><creator>Purwarianti, Ayu</creator><creator>Ruder, Sebastian</creator><creator>Tjhi, William</creator><creator>Limkonchotiwat, Peerat</creator><creator>Aji, Alham Fikri</creator><creator>Keh, Sedrick</creator><creator>Genta Indra Winata</creator><creator>Zhang, Ruochen</creator><creator>Koto, Fajri</creator><creator>Zheng-Xin, Yong</creator><creator>Cahyawijaya, Samuel</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PKEHL</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241008</creationdate><title>SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages</title><author>Holy Lovenia ; Rahmad Mahendra ; Salsabil Maulana Akbar ; Miranda, Lester James V ; Santoso, Jennifer ; Aco, Elyanah ; Akhdan Fadhilah ; Mansurov, Jonibek ; Imperial, Joseph Marvin ; Kampman, Onno P ; Joel Ruben Antony Moniz ; Muhammad Ravi Shulthan Habibi ; Hudi, Frederikus ; Railey Montalan ; Ryan, Ignatius ; Joanito Agili Lopo ; Nixon, William ; Karlsson, Börje F ; Jaya, James ; Diandaru, Ryandito ; Gao, Yuze ; Amadeus, Patrick ; Wang, Bin ; Blaise Cruz, Jan Christian ; Whitehouse, Chenxi ; Ivan Halim Parmonangan ; Khelli, Maria ; Zhang, Wenyu ; Susanto, Lucky ; Reynard Adha Ryanda ; Hermawan, Sonny Lazuardi ; Velasco, Dan John ; Muhammad Dehan Al Kautsar ; Hendria, Willy Fitra ; Moslem, Yasmin ; Flynn, Noah ; Muhammad Farid Adilazuarda ; Li, Haochen ; Lee, Johanes ; Damanhuri, R ; Sun, Shuo ; Qorib, Muhammad Reza ; Djanibekov, Amirbek ; Wei Qi Leong ; Do, Quyet V ; Muennighoff, Niklas ; Pansuwan, Tanrada ; Putra, Ilham Firdausi ; Xu, Yan ; Ngee Chia Tai ; Purwarianti, Ayu ; Ruder, Sebastian ; Tjhi, William ; Limkonchotiwat, Peerat ; Aji, Alham Fikri ; Keh, Sedrick ; Genta Indra Winata ; Zhang, Ruochen ; Koto, Fajri ; Zheng-Xin, Yong ; Cahyawijaya, Samuel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30689110313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Datasets</topic><topic>Languages</topic><topic>Native languages</topic><toplevel>online_resources</toplevel><creatorcontrib>Holy Lovenia</creatorcontrib><creatorcontrib>Rahmad Mahendra</creatorcontrib><creatorcontrib>Salsabil Maulana Akbar</creatorcontrib><creatorcontrib>Miranda, Lester James V</creatorcontrib><creatorcontrib>Santoso, Jennifer</creatorcontrib><creatorcontrib>Aco, Elyanah</creatorcontrib><creatorcontrib>Akhdan Fadhilah</creatorcontrib><creatorcontrib>Mansurov, Jonibek</creatorcontrib><creatorcontrib>Imperial, Joseph Marvin</creatorcontrib><creatorcontrib>Kampman, Onno P</creatorcontrib><creatorcontrib>Joel Ruben Antony Moniz</creatorcontrib><creatorcontrib>Muhammad Ravi Shulthan Habibi</creatorcontrib><creatorcontrib>Hudi, Frederikus</creatorcontrib><creatorcontrib>Railey Montalan</creatorcontrib><creatorcontrib>Ryan, Ignatius</creatorcontrib><creatorcontrib>Joanito Agili Lopo</creatorcontrib><creatorcontrib>Nixon, William</creatorcontrib><creatorcontrib>Karlsson, Börje F</creatorcontrib><creatorcontrib>Jaya, James</creatorcontrib><creatorcontrib>Diandaru, Ryandito</creatorcontrib><creatorcontrib>Gao, Yuze</creatorcontrib><creatorcontrib>Amadeus, Patrick</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><creatorcontrib>Blaise Cruz, Jan Christian</creatorcontrib><creatorcontrib>Whitehouse, Chenxi</creatorcontrib><creatorcontrib>Ivan Halim Parmonangan</creatorcontrib><creatorcontrib>Khelli, Maria</creatorcontrib><creatorcontrib>Zhang, Wenyu</creatorcontrib><creatorcontrib>Susanto, Lucky</creatorcontrib><creatorcontrib>Reynard Adha Ryanda</creatorcontrib><creatorcontrib>Hermawan, Sonny Lazuardi</creatorcontrib><creatorcontrib>Velasco, Dan John</creatorcontrib><creatorcontrib>Muhammad Dehan Al Kautsar</creatorcontrib><creatorcontrib>Hendria, Willy Fitra</creatorcontrib><creatorcontrib>Moslem, Yasmin</creatorcontrib><creatorcontrib>Flynn, Noah</creatorcontrib><creatorcontrib>Muhammad Farid Adilazuarda</creatorcontrib><creatorcontrib>Li, Haochen</creatorcontrib><creatorcontrib>Lee, Johanes</creatorcontrib><creatorcontrib>Damanhuri, R</creatorcontrib><creatorcontrib>Sun, Shuo</creatorcontrib><creatorcontrib>Qorib, Muhammad Reza</creatorcontrib><creatorcontrib>Djanibekov, Amirbek</creatorcontrib><creatorcontrib>Wei Qi Leong</creatorcontrib><creatorcontrib>Do, Quyet V</creatorcontrib><creatorcontrib>Muennighoff, Niklas</creatorcontrib><creatorcontrib>Pansuwan, Tanrada</creatorcontrib><creatorcontrib>Putra, Ilham Firdausi</creatorcontrib><creatorcontrib>Xu, Yan</creatorcontrib><creatorcontrib>Ngee Chia Tai</creatorcontrib><creatorcontrib>Purwarianti, Ayu</creatorcontrib><creatorcontrib>Ruder, Sebastian</creatorcontrib><creatorcontrib>Tjhi, William</creatorcontrib><creatorcontrib>Limkonchotiwat, Peerat</creatorcontrib><creatorcontrib>Aji, Alham Fikri</creatorcontrib><creatorcontrib>Keh, Sedrick</creatorcontrib><creatorcontrib>Genta Indra Winata</creatorcontrib><creatorcontrib>Zhang, Ruochen</creatorcontrib><creatorcontrib>Koto, Fajri</creatorcontrib><creatorcontrib>Zheng-Xin, Yong</creatorcontrib><creatorcontrib>Cahyawijaya, Samuel</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied &amp; Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Holy Lovenia</au><au>Rahmad Mahendra</au><au>Salsabil Maulana Akbar</au><au>Miranda, Lester James V</au><au>Santoso, Jennifer</au><au>Aco, Elyanah</au><au>Akhdan Fadhilah</au><au>Mansurov, Jonibek</au><au>Imperial, Joseph Marvin</au><au>Kampman, Onno P</au><au>Joel Ruben Antony Moniz</au><au>Muhammad Ravi Shulthan Habibi</au><au>Hudi, Frederikus</au><au>Railey Montalan</au><au>Ryan, Ignatius</au><au>Joanito Agili Lopo</au><au>Nixon, William</au><au>Karlsson, Börje F</au><au>Jaya, James</au><au>Diandaru, Ryandito</au><au>Gao, Yuze</au><au>Amadeus, Patrick</au><au>Wang, Bin</au><au>Blaise Cruz, Jan Christian</au><au>Whitehouse, Chenxi</au><au>Ivan Halim Parmonangan</au><au>Khelli, Maria</au><au>Zhang, Wenyu</au><au>Susanto, Lucky</au><au>Reynard Adha Ryanda</au><au>Hermawan, Sonny Lazuardi</au><au>Velasco, Dan John</au><au>Muhammad Dehan Al Kautsar</au><au>Hendria, Willy Fitra</au><au>Moslem, Yasmin</au><au>Flynn, Noah</au><au>Muhammad Farid Adilazuarda</au><au>Li, Haochen</au><au>Lee, Johanes</au><au>Damanhuri, R</au><au>Sun, Shuo</au><au>Qorib, Muhammad Reza</au><au>Djanibekov, Amirbek</au><au>Wei Qi Leong</au><au>Do, Quyet V</au><au>Muennighoff, Niklas</au><au>Pansuwan, Tanrada</au><au>Putra, Ilham Firdausi</au><au>Xu, Yan</au><au>Ngee Chia Tai</au><au>Purwarianti, Ayu</au><au>Ruder, Sebastian</au><au>Tjhi, William</au><au>Limkonchotiwat, Peerat</au><au>Aji, Alham Fikri</au><au>Keh, Sedrick</au><au>Genta Indra Winata</au><au>Zhang, Ruochen</au><au>Koto, Fajri</au><au>Zheng-Xin, Yong</au><au>Cahyawijaya, Samuel</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages</atitle><jtitle>arXiv.org</jtitle><date>2024-10-08</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-10
issn 2331-8422
language eng
recordid cdi_proquest_journals_3068911031
source Free E- Journals
subjects Benchmarks
Datasets
Languages
Native languages
title SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-20T12%3A00%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=SEACrowd:%20A%20Multilingual%20Multimodal%20Data%20Hub%20and%20Benchmark%20Suite%20for%20Southeast%20Asian%20Languages&rft.jtitle=arXiv.org&rft.au=Holy%20Lovenia&rft.date=2024-10-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3068911031%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3068911031&rft_id=info:pmid/&rfr_iscdi=true