SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-10 |
---|---|
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Holy Lovenia Rahmad Mahendra Salsabil Maulana Akbar Miranda, Lester James V Santoso, Jennifer Aco, Elyanah Akhdan Fadhilah Mansurov, Jonibek Imperial, Joseph Marvin Kampman, Onno P Joel Ruben Antony Moniz Muhammad Ravi Shulthan Habibi Hudi, Frederikus Railey Montalan Ryan, Ignatius Joanito Agili Lopo Nixon, William Karlsson, Börje F Jaya, James Diandaru, Ryandito Gao, Yuze Amadeus, Patrick Wang, Bin Blaise Cruz, Jan Christian Whitehouse, Chenxi Ivan Halim Parmonangan Khelli, Maria Zhang, Wenyu Susanto, Lucky Reynard Adha Ryanda Hermawan, Sonny Lazuardi Velasco, Dan John Muhammad Dehan Al Kautsar Hendria, Willy Fitra Moslem, Yasmin Flynn, Noah Muhammad Farid Adilazuarda Li, Haochen Lee, Johanes Damanhuri, R Sun, Shuo Qorib, Muhammad Reza Djanibekov, Amirbek Wei Qi Leong Do, Quyet V Muennighoff, Niklas Pansuwan, Tanrada Putra, Ilham Firdausi Xu, Yan Ngee Chia Tai Purwarianti, Ayu Ruder, Sebastian Tjhi, William Limkonchotiwat, Peerat Aji, Alham Fikri Keh, Sedrick Genta Indra Winata Zhang, Ruochen Koto, Fajri Zheng-Xin, Yong Cahyawijaya, Samuel |
description | Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3068911031</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3068911031</sourcerecordid><originalsourceid>FETCH-proquest_journals_30689110313</originalsourceid><addsrcrecordid>eNqNjMsKwjAURIMgWLT_cMF1IU20Vnf1RRe6qmvlatOXbaJ54O9b0Q9wNXOYwwyIxzgPg3jG2Ij4xjSUUhYt2HzOPXLOdslGq1e-ggSOrrV1W8vSYfuFTuV93aJFSN0VUOawFvJWdajvkLnaCiiUhkw5Wwk0FhJTo4QDfj5KYSZkWGBrhP_LMZnud6dNGjy0ejph7KVRTst-unAaxcswpDzk_1lvnPhC8A</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3068911031</pqid></control><display><type>article</type><title>SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages</title><source>Free E- Journals</source><creator>Holy Lovenia ; Rahmad Mahendra ; Salsabil Maulana Akbar ; Miranda, Lester James V ; Santoso, Jennifer ; Aco, Elyanah ; Akhdan Fadhilah ; Mansurov, Jonibek ; Imperial, Joseph Marvin ; Kampman, Onno P ; Joel Ruben Antony Moniz ; Muhammad Ravi Shulthan Habibi ; Hudi, Frederikus ; Railey Montalan ; Ryan, Ignatius ; Joanito Agili Lopo ; Nixon, William ; Karlsson, Börje F ; Jaya, James ; Diandaru, Ryandito ; Gao, Yuze ; Amadeus, Patrick ; Wang, Bin ; Blaise Cruz, Jan Christian ; Whitehouse, Chenxi ; Ivan Halim Parmonangan ; Khelli, Maria ; Zhang, Wenyu ; Susanto, Lucky ; Reynard Adha Ryanda ; Hermawan, Sonny Lazuardi ; Velasco, Dan John ; Muhammad Dehan Al Kautsar ; Hendria, Willy Fitra ; Moslem, Yasmin ; Flynn, Noah ; Muhammad Farid Adilazuarda ; Li, Haochen ; Lee, Johanes ; Damanhuri, R ; Sun, Shuo ; Qorib, Muhammad Reza ; Djanibekov, Amirbek ; Wei Qi Leong ; Do, Quyet V ; Muennighoff, Niklas ; Pansuwan, Tanrada ; Putra, Ilham Firdausi ; Xu, Yan ; Ngee Chia Tai ; Purwarianti, Ayu ; Ruder, Sebastian ; Tjhi, William ; Limkonchotiwat, Peerat ; Aji, Alham Fikri ; Keh, Sedrick ; Genta Indra Winata ; Zhang, Ruochen ; Koto, Fajri ; Zheng-Xin, Yong ; Cahyawijaya, Samuel</creator><creatorcontrib>Holy Lovenia ; Rahmad Mahendra ; Salsabil Maulana Akbar ; Miranda, Lester James V ; Santoso, Jennifer ; Aco, Elyanah ; Akhdan Fadhilah ; Mansurov, Jonibek ; Imperial, Joseph Marvin ; Kampman, Onno P ; Joel Ruben Antony Moniz ; Muhammad Ravi Shulthan Habibi ; Hudi, Frederikus ; Railey Montalan ; Ryan, Ignatius ; Joanito Agili Lopo ; Nixon, William ; Karlsson, Börje F ; Jaya, James ; Diandaru, Ryandito ; Gao, Yuze ; Amadeus, Patrick ; Wang, Bin ; Blaise Cruz, Jan Christian ; Whitehouse, Chenxi ; Ivan Halim Parmonangan ; Khelli, Maria ; Zhang, Wenyu ; Susanto, Lucky ; Reynard Adha Ryanda ; Hermawan, Sonny Lazuardi ; Velasco, Dan John ; Muhammad Dehan Al Kautsar ; Hendria, Willy Fitra ; Moslem, Yasmin ; Flynn, Noah ; Muhammad Farid Adilazuarda ; Li, Haochen ; Lee, Johanes ; Damanhuri, R ; Sun, Shuo ; Qorib, Muhammad Reza ; Djanibekov, Amirbek ; Wei Qi Leong ; Do, Quyet V ; Muennighoff, Niklas ; Pansuwan, Tanrada ; Putra, Ilham Firdausi ; Xu, Yan ; Ngee Chia Tai ; Purwarianti, Ayu ; Ruder, Sebastian ; Tjhi, William ; Limkonchotiwat, Peerat ; Aji, Alham Fikri ; Keh, Sedrick ; Genta Indra Winata ; Zhang, Ruochen ; Koto, Fajri ; Zheng-Xin, Yong ; Cahyawijaya, Samuel</creatorcontrib><description>Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Benchmarks ; Datasets ; Languages ; Native languages</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Holy Lovenia</creatorcontrib><creatorcontrib>Rahmad Mahendra</creatorcontrib><creatorcontrib>Salsabil Maulana Akbar</creatorcontrib><creatorcontrib>Miranda, Lester James V</creatorcontrib><creatorcontrib>Santoso, Jennifer</creatorcontrib><creatorcontrib>Aco, Elyanah</creatorcontrib><creatorcontrib>Akhdan Fadhilah</creatorcontrib><creatorcontrib>Mansurov, Jonibek</creatorcontrib><creatorcontrib>Imperial, Joseph Marvin</creatorcontrib><creatorcontrib>Kampman, Onno P</creatorcontrib><creatorcontrib>Joel Ruben Antony Moniz</creatorcontrib><creatorcontrib>Muhammad Ravi Shulthan Habibi</creatorcontrib><creatorcontrib>Hudi, Frederikus</creatorcontrib><creatorcontrib>Railey Montalan</creatorcontrib><creatorcontrib>Ryan, Ignatius</creatorcontrib><creatorcontrib>Joanito Agili Lopo</creatorcontrib><creatorcontrib>Nixon, William</creatorcontrib><creatorcontrib>Karlsson, Börje F</creatorcontrib><creatorcontrib>Jaya, James</creatorcontrib><creatorcontrib>Diandaru, Ryandito</creatorcontrib><creatorcontrib>Gao, Yuze</creatorcontrib><creatorcontrib>Amadeus, Patrick</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><creatorcontrib>Blaise Cruz, Jan Christian</creatorcontrib><creatorcontrib>Whitehouse, Chenxi</creatorcontrib><creatorcontrib>Ivan Halim Parmonangan</creatorcontrib><creatorcontrib>Khelli, Maria</creatorcontrib><creatorcontrib>Zhang, Wenyu</creatorcontrib><creatorcontrib>Susanto, Lucky</creatorcontrib><creatorcontrib>Reynard Adha Ryanda</creatorcontrib><creatorcontrib>Hermawan, Sonny Lazuardi</creatorcontrib><creatorcontrib>Velasco, Dan John</creatorcontrib><creatorcontrib>Muhammad Dehan Al Kautsar</creatorcontrib><creatorcontrib>Hendria, Willy Fitra</creatorcontrib><creatorcontrib>Moslem, Yasmin</creatorcontrib><creatorcontrib>Flynn, Noah</creatorcontrib><creatorcontrib>Muhammad Farid Adilazuarda</creatorcontrib><creatorcontrib>Li, Haochen</creatorcontrib><creatorcontrib>Lee, Johanes</creatorcontrib><creatorcontrib>Damanhuri, R</creatorcontrib><creatorcontrib>Sun, Shuo</creatorcontrib><creatorcontrib>Qorib, Muhammad Reza</creatorcontrib><creatorcontrib>Djanibekov, Amirbek</creatorcontrib><creatorcontrib>Wei Qi Leong</creatorcontrib><creatorcontrib>Do, Quyet V</creatorcontrib><creatorcontrib>Muennighoff, Niklas</creatorcontrib><creatorcontrib>Pansuwan, Tanrada</creatorcontrib><creatorcontrib>Putra, Ilham Firdausi</creatorcontrib><creatorcontrib>Xu, Yan</creatorcontrib><creatorcontrib>Ngee Chia Tai</creatorcontrib><creatorcontrib>Purwarianti, Ayu</creatorcontrib><creatorcontrib>Ruder, Sebastian</creatorcontrib><creatorcontrib>Tjhi, William</creatorcontrib><creatorcontrib>Limkonchotiwat, Peerat</creatorcontrib><creatorcontrib>Aji, Alham Fikri</creatorcontrib><creatorcontrib>Keh, Sedrick</creatorcontrib><creatorcontrib>Genta Indra Winata</creatorcontrib><creatorcontrib>Zhang, Ruochen</creatorcontrib><creatorcontrib>Koto, Fajri</creatorcontrib><creatorcontrib>Zheng-Xin, Yong</creatorcontrib><creatorcontrib>Cahyawijaya, Samuel</creatorcontrib><title>SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages</title><title>arXiv.org</title><description>Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.</description><subject>Benchmarks</subject><subject>Datasets</subject><subject>Languages</subject><subject>Native languages</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNjMsKwjAURIMgWLT_cMF1IU20Vnf1RRe6qmvlatOXbaJ54O9b0Q9wNXOYwwyIxzgPg3jG2Ij4xjSUUhYt2HzOPXLOdslGq1e-ggSOrrV1W8vSYfuFTuV93aJFSN0VUOawFvJWdajvkLnaCiiUhkw5Wwk0FhJTo4QDfj5KYSZkWGBrhP_LMZnud6dNGjy0ejph7KVRTst-unAaxcswpDzk_1lvnPhC8A</recordid><startdate>20241008</startdate><enddate>20241008</enddate><creator>Holy Lovenia</creator><creator>Rahmad Mahendra</creator><creator>Salsabil Maulana Akbar</creator><creator>Miranda, Lester James V</creator><creator>Santoso, Jennifer</creator><creator>Aco, Elyanah</creator><creator>Akhdan Fadhilah</creator><creator>Mansurov, Jonibek</creator><creator>Imperial, Joseph Marvin</creator><creator>Kampman, Onno P</creator><creator>Joel Ruben Antony Moniz</creator><creator>Muhammad Ravi Shulthan Habibi</creator><creator>Hudi, Frederikus</creator><creator>Railey Montalan</creator><creator>Ryan, Ignatius</creator><creator>Joanito Agili Lopo</creator><creator>Nixon, William</creator><creator>Karlsson, Börje F</creator><creator>Jaya, James</creator><creator>Diandaru, Ryandito</creator><creator>Gao, Yuze</creator><creator>Amadeus, Patrick</creator><creator>Wang, Bin</creator><creator>Blaise Cruz, Jan Christian</creator><creator>Whitehouse, Chenxi</creator><creator>Ivan Halim Parmonangan</creator><creator>Khelli, Maria</creator><creator>Zhang, Wenyu</creator><creator>Susanto, Lucky</creator><creator>Reynard Adha Ryanda</creator><creator>Hermawan, Sonny Lazuardi</creator><creator>Velasco, Dan John</creator><creator>Muhammad Dehan Al Kautsar</creator><creator>Hendria, Willy Fitra</creator><creator>Moslem, Yasmin</creator><creator>Flynn, Noah</creator><creator>Muhammad Farid Adilazuarda</creator><creator>Li, Haochen</creator><creator>Lee, Johanes</creator><creator>Damanhuri, R</creator><creator>Sun, Shuo</creator><creator>Qorib, Muhammad Reza</creator><creator>Djanibekov, Amirbek</creator><creator>Wei Qi Leong</creator><creator>Do, Quyet V</creator><creator>Muennighoff, Niklas</creator><creator>Pansuwan, Tanrada</creator><creator>Putra, Ilham Firdausi</creator><creator>Xu, Yan</creator><creator>Ngee Chia Tai</creator><creator>Purwarianti, Ayu</creator><creator>Ruder, Sebastian</creator><creator>Tjhi, William</creator><creator>Limkonchotiwat, Peerat</creator><creator>Aji, Alham Fikri</creator><creator>Keh, Sedrick</creator><creator>Genta Indra Winata</creator><creator>Zhang, Ruochen</creator><creator>Koto, Fajri</creator><creator>Zheng-Xin, Yong</creator><creator>Cahyawijaya, Samuel</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PKEHL</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241008</creationdate><title>SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages</title><author>Holy Lovenia ; Rahmad Mahendra ; Salsabil Maulana Akbar ; Miranda, Lester James V ; Santoso, Jennifer ; Aco, Elyanah ; Akhdan Fadhilah ; Mansurov, Jonibek ; Imperial, Joseph Marvin ; Kampman, Onno P ; Joel Ruben Antony Moniz ; Muhammad Ravi Shulthan Habibi ; Hudi, Frederikus ; Railey Montalan ; Ryan, Ignatius ; Joanito Agili Lopo ; Nixon, William ; Karlsson, Börje F ; Jaya, James ; Diandaru, Ryandito ; Gao, Yuze ; Amadeus, Patrick ; Wang, Bin ; Blaise Cruz, Jan Christian ; Whitehouse, Chenxi ; Ivan Halim Parmonangan ; Khelli, Maria ; Zhang, Wenyu ; Susanto, Lucky ; Reynard Adha Ryanda ; Hermawan, Sonny Lazuardi ; Velasco, Dan John ; Muhammad Dehan Al Kautsar ; Hendria, Willy Fitra ; Moslem, Yasmin ; Flynn, Noah ; Muhammad Farid Adilazuarda ; Li, Haochen ; Lee, Johanes ; Damanhuri, R ; Sun, Shuo ; Qorib, Muhammad Reza ; Djanibekov, Amirbek ; Wei Qi Leong ; Do, Quyet V ; Muennighoff, Niklas ; Pansuwan, Tanrada ; Putra, Ilham Firdausi ; Xu, Yan ; Ngee Chia Tai ; Purwarianti, Ayu ; Ruder, Sebastian ; Tjhi, William ; Limkonchotiwat, Peerat ; Aji, Alham Fikri ; Keh, Sedrick ; Genta Indra Winata ; Zhang, Ruochen ; Koto, Fajri ; Zheng-Xin, Yong ; Cahyawijaya, Samuel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30689110313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Benchmarks</topic><topic>Datasets</topic><topic>Languages</topic><topic>Native languages</topic><toplevel>online_resources</toplevel><creatorcontrib>Holy Lovenia</creatorcontrib><creatorcontrib>Rahmad Mahendra</creatorcontrib><creatorcontrib>Salsabil Maulana Akbar</creatorcontrib><creatorcontrib>Miranda, Lester James V</creatorcontrib><creatorcontrib>Santoso, Jennifer</creatorcontrib><creatorcontrib>Aco, Elyanah</creatorcontrib><creatorcontrib>Akhdan Fadhilah</creatorcontrib><creatorcontrib>Mansurov, Jonibek</creatorcontrib><creatorcontrib>Imperial, Joseph Marvin</creatorcontrib><creatorcontrib>Kampman, Onno P</creatorcontrib><creatorcontrib>Joel Ruben Antony Moniz</creatorcontrib><creatorcontrib>Muhammad Ravi Shulthan Habibi</creatorcontrib><creatorcontrib>Hudi, Frederikus</creatorcontrib><creatorcontrib>Railey Montalan</creatorcontrib><creatorcontrib>Ryan, Ignatius</creatorcontrib><creatorcontrib>Joanito Agili Lopo</creatorcontrib><creatorcontrib>Nixon, William</creatorcontrib><creatorcontrib>Karlsson, Börje F</creatorcontrib><creatorcontrib>Jaya, James</creatorcontrib><creatorcontrib>Diandaru, Ryandito</creatorcontrib><creatorcontrib>Gao, Yuze</creatorcontrib><creatorcontrib>Amadeus, Patrick</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><creatorcontrib>Blaise Cruz, Jan Christian</creatorcontrib><creatorcontrib>Whitehouse, Chenxi</creatorcontrib><creatorcontrib>Ivan Halim Parmonangan</creatorcontrib><creatorcontrib>Khelli, Maria</creatorcontrib><creatorcontrib>Zhang, Wenyu</creatorcontrib><creatorcontrib>Susanto, Lucky</creatorcontrib><creatorcontrib>Reynard Adha Ryanda</creatorcontrib><creatorcontrib>Hermawan, Sonny Lazuardi</creatorcontrib><creatorcontrib>Velasco, Dan John</creatorcontrib><creatorcontrib>Muhammad Dehan Al Kautsar</creatorcontrib><creatorcontrib>Hendria, Willy Fitra</creatorcontrib><creatorcontrib>Moslem, Yasmin</creatorcontrib><creatorcontrib>Flynn, Noah</creatorcontrib><creatorcontrib>Muhammad Farid Adilazuarda</creatorcontrib><creatorcontrib>Li, Haochen</creatorcontrib><creatorcontrib>Lee, Johanes</creatorcontrib><creatorcontrib>Damanhuri, R</creatorcontrib><creatorcontrib>Sun, Shuo</creatorcontrib><creatorcontrib>Qorib, Muhammad Reza</creatorcontrib><creatorcontrib>Djanibekov, Amirbek</creatorcontrib><creatorcontrib>Wei Qi Leong</creatorcontrib><creatorcontrib>Do, Quyet V</creatorcontrib><creatorcontrib>Muennighoff, Niklas</creatorcontrib><creatorcontrib>Pansuwan, Tanrada</creatorcontrib><creatorcontrib>Putra, Ilham Firdausi</creatorcontrib><creatorcontrib>Xu, Yan</creatorcontrib><creatorcontrib>Ngee Chia Tai</creatorcontrib><creatorcontrib>Purwarianti, Ayu</creatorcontrib><creatorcontrib>Ruder, Sebastian</creatorcontrib><creatorcontrib>Tjhi, William</creatorcontrib><creatorcontrib>Limkonchotiwat, Peerat</creatorcontrib><creatorcontrib>Aji, Alham Fikri</creatorcontrib><creatorcontrib>Keh, Sedrick</creatorcontrib><creatorcontrib>Genta Indra Winata</creatorcontrib><creatorcontrib>Zhang, Ruochen</creatorcontrib><creatorcontrib>Koto, Fajri</creatorcontrib><creatorcontrib>Zheng-Xin, Yong</creatorcontrib><creatorcontrib>Cahyawijaya, Samuel</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied & Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Holy Lovenia</au><au>Rahmad Mahendra</au><au>Salsabil Maulana Akbar</au><au>Miranda, Lester James V</au><au>Santoso, Jennifer</au><au>Aco, Elyanah</au><au>Akhdan Fadhilah</au><au>Mansurov, Jonibek</au><au>Imperial, Joseph Marvin</au><au>Kampman, Onno P</au><au>Joel Ruben Antony Moniz</au><au>Muhammad Ravi Shulthan Habibi</au><au>Hudi, Frederikus</au><au>Railey Montalan</au><au>Ryan, Ignatius</au><au>Joanito Agili Lopo</au><au>Nixon, William</au><au>Karlsson, Börje F</au><au>Jaya, James</au><au>Diandaru, Ryandito</au><au>Gao, Yuze</au><au>Amadeus, Patrick</au><au>Wang, Bin</au><au>Blaise Cruz, Jan Christian</au><au>Whitehouse, Chenxi</au><au>Ivan Halim Parmonangan</au><au>Khelli, Maria</au><au>Zhang, Wenyu</au><au>Susanto, Lucky</au><au>Reynard Adha Ryanda</au><au>Hermawan, Sonny Lazuardi</au><au>Velasco, Dan John</au><au>Muhammad Dehan Al Kautsar</au><au>Hendria, Willy Fitra</au><au>Moslem, Yasmin</au><au>Flynn, Noah</au><au>Muhammad Farid Adilazuarda</au><au>Li, Haochen</au><au>Lee, Johanes</au><au>Damanhuri, R</au><au>Sun, Shuo</au><au>Qorib, Muhammad Reza</au><au>Djanibekov, Amirbek</au><au>Wei Qi Leong</au><au>Do, Quyet V</au><au>Muennighoff, Niklas</au><au>Pansuwan, Tanrada</au><au>Putra, Ilham Firdausi</au><au>Xu, Yan</au><au>Ngee Chia Tai</au><au>Purwarianti, Ayu</au><au>Ruder, Sebastian</au><au>Tjhi, William</au><au>Limkonchotiwat, Peerat</au><au>Aji, Alham Fikri</au><au>Keh, Sedrick</au><au>Genta Indra Winata</au><au>Zhang, Ruochen</au><au>Koto, Fajri</au><au>Zheng-Xin, Yong</au><au>Cahyawijaya, Samuel</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages</atitle><jtitle>arXiv.org</jtitle><date>2024-10-08</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-10 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3068911031 |
source | Free E- Journals |
subjects | Benchmarks Datasets Languages Native languages |
title | SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-20T12%3A00%3A49IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=SEACrowd:%20A%20Multilingual%20Multimodal%20Data%20Hub%20and%20Benchmark%20Suite%20for%20Southeast%20Asian%20Languages&rft.jtitle=arXiv.org&rft.au=Holy%20Lovenia&rft.date=2024-10-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3068911031%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3068911031&rft_id=info:pmid/&rfr_iscdi=true |