CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Romero, David, Lyu, Chenyang, Wibowo, Haryo Akbarianto, Lynn, Teresa, Hamed, Injy, Kishore, Aditya Nanda, Mandal, Aishik, Dragonetti, Alina, Abzaliev, Artem, Tonja, Atnafu Lambebo, Balcha, Bontu Fufa, Whitehouse, Chenxi, Salamea, Christian, Velasco, Dan John, Adelani, David Ifeoluwa, Meur, David Le, Villa-Cueva, Emilio, Koto, Fajri, Farooqui, Fauzan, Belcavello, Frederico, Batnasan, Ganzorig, Vallejo, Gisela, Caulfield, Grainne, Ivetta, Guido, Song, Haiyue, Ademtew, Henok Biadglign, Maina, Hernán, Lovenia, Holy, Azime, Israel Abebe, Cruz, Jan Christian Blaise, Gala, Jay, Geng, Jiahui, Ortiz-Barajas, Jesus-German, Baek, Jinheon, Dunstan, Jocelyn, Alemany, Laura Alonso, Nagasinghe, Kumaranage Ravindu Yasas, Benotti, Luciana, D'Haro, Luis Fernando, Viridiano, Marcelo, Estecha-Garitagoitia, Marcos, Cabrera, Maria Camila Buitrago, Rodríguez-Cantelar, Mario, Jouitteau, Mélanie, Mihaylov, Mihail, Imam, Mohamed Fazli Mohamed, Adilazuarda, Muhammad Farid, Gochoo, Munkhjargal, Otgonbold, Munkh-Erdene, Etori, Naome, Niyomugisha, Olivier, Silva, Paula Mónica, Chitale, Pranjal, Dabre, Raj, Chevi, Rendi, Zhang, Ruochen, Diandaru, Ryandito, Cahyawijaya, Samuel, Góngora, Santiago, Jeong, Soyeong, Purkayastha, Sukannya, Kuribayashi, Tatsuki, Clifford, Teresa, Jayakumar, Thanmay, Torrent, Tiago Timponi, Ehsan, Toqeer, Araujo, Vladimir, Kementchedjhieva, Yova, Burzo, Zara, Lim, Zheng Wei, Yong, Zheng Xin, Ignat, Oana, Nwatu, Joan, Mihalcea, Rada, Solorio, Thamar, Aji, Alham Fikri
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Romero, David
Lyu, Chenyang
Wibowo, Haryo Akbarianto
Lynn, Teresa
Hamed, Injy
Kishore, Aditya Nanda
Mandal, Aishik
Dragonetti, Alina
Abzaliev, Artem
Tonja, Atnafu Lambebo
Balcha, Bontu Fufa
Whitehouse, Chenxi
Salamea, Christian
Velasco, Dan John
Adelani, David Ifeoluwa
Meur, David Le
Villa-Cueva, Emilio
Koto, Fajri
Farooqui, Fauzan
Belcavello, Frederico
Batnasan, Ganzorig
Vallejo, Gisela
Caulfield, Grainne
Ivetta, Guido
Song, Haiyue
Ademtew, Henok Biadglign
Maina, Hernán
Lovenia, Holy
Azime, Israel Abebe
Cruz, Jan Christian Blaise
Gala, Jay
Geng, Jiahui
Ortiz-Barajas, Jesus-German
Baek, Jinheon
Dunstan, Jocelyn
Alemany, Laura Alonso
Nagasinghe, Kumaranage Ravindu Yasas
Benotti, Luciana
D'Haro, Luis Fernando
Viridiano, Marcelo
Estecha-Garitagoitia, Marcos
Cabrera, Maria Camila Buitrago
Rodríguez-Cantelar, Mario
Jouitteau, Mélanie
Mihaylov, Mihail
Imam, Mohamed Fazli Mohamed
Adilazuarda, Muhammad Farid
Gochoo, Munkhjargal
Otgonbold, Munkh-Erdene
Etori, Naome
Niyomugisha, Olivier
Silva, Paula Mónica
Chitale, Pranjal
Dabre, Raj
Chevi, Rendi
Zhang, Ruochen
Diandaru, Ryandito
Cahyawijaya, Samuel
Góngora, Santiago
Jeong, Soyeong
Purkayastha, Sukannya
Kuribayashi, Tatsuki
Clifford, Teresa
Jayakumar, Thanmay
Torrent, Tiago Timponi
Ehsan, Toqeer
Araujo, Vladimir
Kementchedjhieva, Yova
Burzo, Zara
Lim, Zheng Wei
Yong, Zheng Xin
Ignat, Oana
Nwatu, Joan
Mihalcea, Rada
Solorio, Thamar
Aji, Alham Fikri
description Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.
doi_str_mv 10.48550/arxiv.2406.05967
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_05967</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_05967</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-258885d1326ecaed58a26f522c28d6c9c5455c5abddd5a0b2f32f90127bdf9053</originalsourceid><addsrcrecordid>eNotj8FOwzAQRH3hgFo-gBP-gQRnk3Wc3kJUoFIRqlT1Gm1sByxMqOym0L8nLZyeNCPN6DF2m4m0UIjinsKPO6ZQCJkKrGR5zVbNblMveDP6wxjI-1Ni3NGGaPnLFDnvhreRPN-5eMZmtPHgvgZeD_HbhqnkD3bQ758UPubsqicf7c0_Z2z7uNw2z8n69WnV1OuEZFkmgEopNFkO0mqyBhWB7BFAgzJSVxoLRI3UGWOQRAd9Dn0lMig7MxHzGbv7m724tPvgpvNTe3ZqL075LxojR4M</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark</title><source>arXiv.org</source><creator>Romero, David ; Lyu, Chenyang ; Wibowo, Haryo Akbarianto ; Lynn, Teresa ; Hamed, Injy ; Kishore, Aditya Nanda ; Mandal, Aishik ; Dragonetti, Alina ; Abzaliev, Artem ; Tonja, Atnafu Lambebo ; Balcha, Bontu Fufa ; Whitehouse, Chenxi ; Salamea, Christian ; Velasco, Dan John ; Adelani, David Ifeoluwa ; Meur, David Le ; Villa-Cueva, Emilio ; Koto, Fajri ; Farooqui, Fauzan ; Belcavello, Frederico ; Batnasan, Ganzorig ; Vallejo, Gisela ; Caulfield, Grainne ; Ivetta, Guido ; Song, Haiyue ; Ademtew, Henok Biadglign ; Maina, Hernán ; Lovenia, Holy ; Azime, Israel Abebe ; Cruz, Jan Christian Blaise ; Gala, Jay ; Geng, Jiahui ; Ortiz-Barajas, Jesus-German ; Baek, Jinheon ; Dunstan, Jocelyn ; Alemany, Laura Alonso ; Nagasinghe, Kumaranage Ravindu Yasas ; Benotti, Luciana ; D'Haro, Luis Fernando ; Viridiano, Marcelo ; Estecha-Garitagoitia, Marcos ; Cabrera, Maria Camila Buitrago ; Rodríguez-Cantelar, Mario ; Jouitteau, Mélanie ; Mihaylov, Mihail ; Imam, Mohamed Fazli Mohamed ; Adilazuarda, Muhammad Farid ; Gochoo, Munkhjargal ; Otgonbold, Munkh-Erdene ; Etori, Naome ; Niyomugisha, Olivier ; Silva, Paula Mónica ; Chitale, Pranjal ; Dabre, Raj ; Chevi, Rendi ; Zhang, Ruochen ; Diandaru, Ryandito ; Cahyawijaya, Samuel ; Góngora, Santiago ; Jeong, Soyeong ; Purkayastha, Sukannya ; Kuribayashi, Tatsuki ; Clifford, Teresa ; Jayakumar, Thanmay ; Torrent, Tiago Timponi ; Ehsan, Toqeer ; Araujo, Vladimir ; Kementchedjhieva, Yova ; Burzo, Zara ; Lim, Zheng Wei ; Yong, Zheng Xin ; Ignat, Oana ; Nwatu, Joan ; Mihalcea, Rada ; Solorio, Thamar ; Aji, Alham Fikri</creator><creatorcontrib>Romero, David ; Lyu, Chenyang ; Wibowo, Haryo Akbarianto ; Lynn, Teresa ; Hamed, Injy ; Kishore, Aditya Nanda ; Mandal, Aishik ; Dragonetti, Alina ; Abzaliev, Artem ; Tonja, Atnafu Lambebo ; Balcha, Bontu Fufa ; Whitehouse, Chenxi ; Salamea, Christian ; Velasco, Dan John ; Adelani, David Ifeoluwa ; Meur, David Le ; Villa-Cueva, Emilio ; Koto, Fajri ; Farooqui, Fauzan ; Belcavello, Frederico ; Batnasan, Ganzorig ; Vallejo, Gisela ; Caulfield, Grainne ; Ivetta, Guido ; Song, Haiyue ; Ademtew, Henok Biadglign ; Maina, Hernán ; Lovenia, Holy ; Azime, Israel Abebe ; Cruz, Jan Christian Blaise ; Gala, Jay ; Geng, Jiahui ; Ortiz-Barajas, Jesus-German ; Baek, Jinheon ; Dunstan, Jocelyn ; Alemany, Laura Alonso ; Nagasinghe, Kumaranage Ravindu Yasas ; Benotti, Luciana ; D'Haro, Luis Fernando ; Viridiano, Marcelo ; Estecha-Garitagoitia, Marcos ; Cabrera, Maria Camila Buitrago ; Rodríguez-Cantelar, Mario ; Jouitteau, Mélanie ; Mihaylov, Mihail ; Imam, Mohamed Fazli Mohamed ; Adilazuarda, Muhammad Farid ; Gochoo, Munkhjargal ; Otgonbold, Munkh-Erdene ; Etori, Naome ; Niyomugisha, Olivier ; Silva, Paula Mónica ; Chitale, Pranjal ; Dabre, Raj ; Chevi, Rendi ; Zhang, Ruochen ; Diandaru, Ryandito ; Cahyawijaya, Samuel ; Góngora, Santiago ; Jeong, Soyeong ; Purkayastha, Sukannya ; Kuribayashi, Tatsuki ; Clifford, Teresa ; Jayakumar, Thanmay ; Torrent, Tiago Timponi ; Ehsan, Toqeer ; Araujo, Vladimir ; Kementchedjhieva, Yova ; Burzo, Zara ; Lim, Zheng Wei ; Yong, Zheng Xin ; Ignat, Oana ; Nwatu, Joan ; Mihalcea, Rada ; Solorio, Thamar ; Aji, Alham Fikri</creatorcontrib><description>Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.</description><identifier>DOI: 10.48550/arxiv.2406.05967</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.05967$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.05967$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Romero, David</creatorcontrib><creatorcontrib>Lyu, Chenyang</creatorcontrib><creatorcontrib>Wibowo, Haryo Akbarianto</creatorcontrib><creatorcontrib>Lynn, Teresa</creatorcontrib><creatorcontrib>Hamed, Injy</creatorcontrib><creatorcontrib>Kishore, Aditya Nanda</creatorcontrib><creatorcontrib>Mandal, Aishik</creatorcontrib><creatorcontrib>Dragonetti, Alina</creatorcontrib><creatorcontrib>Abzaliev, Artem</creatorcontrib><creatorcontrib>Tonja, Atnafu Lambebo</creatorcontrib><creatorcontrib>Balcha, Bontu Fufa</creatorcontrib><creatorcontrib>Whitehouse, Chenxi</creatorcontrib><creatorcontrib>Salamea, Christian</creatorcontrib><creatorcontrib>Velasco, Dan John</creatorcontrib><creatorcontrib>Adelani, David Ifeoluwa</creatorcontrib><creatorcontrib>Meur, David Le</creatorcontrib><creatorcontrib>Villa-Cueva, Emilio</creatorcontrib><creatorcontrib>Koto, Fajri</creatorcontrib><creatorcontrib>Farooqui, Fauzan</creatorcontrib><creatorcontrib>Belcavello, Frederico</creatorcontrib><creatorcontrib>Batnasan, Ganzorig</creatorcontrib><creatorcontrib>Vallejo, Gisela</creatorcontrib><creatorcontrib>Caulfield, Grainne</creatorcontrib><creatorcontrib>Ivetta, Guido</creatorcontrib><creatorcontrib>Song, Haiyue</creatorcontrib><creatorcontrib>Ademtew, Henok Biadglign</creatorcontrib><creatorcontrib>Maina, Hernán</creatorcontrib><creatorcontrib>Lovenia, Holy</creatorcontrib><creatorcontrib>Azime, Israel Abebe</creatorcontrib><creatorcontrib>Cruz, Jan Christian Blaise</creatorcontrib><creatorcontrib>Gala, Jay</creatorcontrib><creatorcontrib>Geng, Jiahui</creatorcontrib><creatorcontrib>Ortiz-Barajas, Jesus-German</creatorcontrib><creatorcontrib>Baek, Jinheon</creatorcontrib><creatorcontrib>Dunstan, Jocelyn</creatorcontrib><creatorcontrib>Alemany, Laura Alonso</creatorcontrib><creatorcontrib>Nagasinghe, Kumaranage Ravindu Yasas</creatorcontrib><creatorcontrib>Benotti, Luciana</creatorcontrib><creatorcontrib>D'Haro, Luis Fernando</creatorcontrib><creatorcontrib>Viridiano, Marcelo</creatorcontrib><creatorcontrib>Estecha-Garitagoitia, Marcos</creatorcontrib><creatorcontrib>Cabrera, Maria Camila Buitrago</creatorcontrib><creatorcontrib>Rodríguez-Cantelar, Mario</creatorcontrib><creatorcontrib>Jouitteau, Mélanie</creatorcontrib><creatorcontrib>Mihaylov, Mihail</creatorcontrib><creatorcontrib>Imam, Mohamed Fazli Mohamed</creatorcontrib><creatorcontrib>Adilazuarda, Muhammad Farid</creatorcontrib><creatorcontrib>Gochoo, Munkhjargal</creatorcontrib><creatorcontrib>Otgonbold, Munkh-Erdene</creatorcontrib><creatorcontrib>Etori, Naome</creatorcontrib><creatorcontrib>Niyomugisha, Olivier</creatorcontrib><creatorcontrib>Silva, Paula Mónica</creatorcontrib><creatorcontrib>Chitale, Pranjal</creatorcontrib><creatorcontrib>Dabre, Raj</creatorcontrib><creatorcontrib>Chevi, Rendi</creatorcontrib><creatorcontrib>Zhang, Ruochen</creatorcontrib><creatorcontrib>Diandaru, Ryandito</creatorcontrib><creatorcontrib>Cahyawijaya, Samuel</creatorcontrib><creatorcontrib>Góngora, Santiago</creatorcontrib><creatorcontrib>Jeong, Soyeong</creatorcontrib><creatorcontrib>Purkayastha, Sukannya</creatorcontrib><creatorcontrib>Kuribayashi, Tatsuki</creatorcontrib><creatorcontrib>Clifford, Teresa</creatorcontrib><creatorcontrib>Jayakumar, Thanmay</creatorcontrib><creatorcontrib>Torrent, Tiago Timponi</creatorcontrib><creatorcontrib>Ehsan, Toqeer</creatorcontrib><creatorcontrib>Araujo, Vladimir</creatorcontrib><creatorcontrib>Kementchedjhieva, Yova</creatorcontrib><creatorcontrib>Burzo, Zara</creatorcontrib><creatorcontrib>Lim, Zheng Wei</creatorcontrib><creatorcontrib>Yong, Zheng Xin</creatorcontrib><creatorcontrib>Ignat, Oana</creatorcontrib><creatorcontrib>Nwatu, Joan</creatorcontrib><creatorcontrib>Mihalcea, Rada</creatorcontrib><creatorcontrib>Solorio, Thamar</creatorcontrib><creatorcontrib>Aji, Alham Fikri</creatorcontrib><title>CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark</title><description>Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8FOwzAQRH3hgFo-gBP-gQRnk3Wc3kJUoFIRqlT1Gm1sByxMqOym0L8nLZyeNCPN6DF2m4m0UIjinsKPO6ZQCJkKrGR5zVbNblMveDP6wxjI-1Ni3NGGaPnLFDnvhreRPN-5eMZmtPHgvgZeD_HbhqnkD3bQ758UPubsqicf7c0_Z2z7uNw2z8n69WnV1OuEZFkmgEopNFkO0mqyBhWB7BFAgzJSVxoLRI3UGWOQRAd9Dn0lMig7MxHzGbv7m724tPvgpvNTe3ZqL075LxojR4M</recordid><startdate>20240609</startdate><enddate>20240609</enddate><creator>Romero, David</creator><creator>Lyu, Chenyang</creator><creator>Wibowo, Haryo Akbarianto</creator><creator>Lynn, Teresa</creator><creator>Hamed, Injy</creator><creator>Kishore, Aditya Nanda</creator><creator>Mandal, Aishik</creator><creator>Dragonetti, Alina</creator><creator>Abzaliev, Artem</creator><creator>Tonja, Atnafu Lambebo</creator><creator>Balcha, Bontu Fufa</creator><creator>Whitehouse, Chenxi</creator><creator>Salamea, Christian</creator><creator>Velasco, Dan John</creator><creator>Adelani, David Ifeoluwa</creator><creator>Meur, David Le</creator><creator>Villa-Cueva, Emilio</creator><creator>Koto, Fajri</creator><creator>Farooqui, Fauzan</creator><creator>Belcavello, Frederico</creator><creator>Batnasan, Ganzorig</creator><creator>Vallejo, Gisela</creator><creator>Caulfield, Grainne</creator><creator>Ivetta, Guido</creator><creator>Song, Haiyue</creator><creator>Ademtew, Henok Biadglign</creator><creator>Maina, Hernán</creator><creator>Lovenia, Holy</creator><creator>Azime, Israel Abebe</creator><creator>Cruz, Jan Christian Blaise</creator><creator>Gala, Jay</creator><creator>Geng, Jiahui</creator><creator>Ortiz-Barajas, Jesus-German</creator><creator>Baek, Jinheon</creator><creator>Dunstan, Jocelyn</creator><creator>Alemany, Laura Alonso</creator><creator>Nagasinghe, Kumaranage Ravindu Yasas</creator><creator>Benotti, Luciana</creator><creator>D'Haro, Luis Fernando</creator><creator>Viridiano, Marcelo</creator><creator>Estecha-Garitagoitia, Marcos</creator><creator>Cabrera, Maria Camila Buitrago</creator><creator>Rodríguez-Cantelar, Mario</creator><creator>Jouitteau, Mélanie</creator><creator>Mihaylov, Mihail</creator><creator>Imam, Mohamed Fazli Mohamed</creator><creator>Adilazuarda, Muhammad Farid</creator><creator>Gochoo, Munkhjargal</creator><creator>Otgonbold, Munkh-Erdene</creator><creator>Etori, Naome</creator><creator>Niyomugisha, Olivier</creator><creator>Silva, Paula Mónica</creator><creator>Chitale, Pranjal</creator><creator>Dabre, Raj</creator><creator>Chevi, Rendi</creator><creator>Zhang, Ruochen</creator><creator>Diandaru, Ryandito</creator><creator>Cahyawijaya, Samuel</creator><creator>Góngora, Santiago</creator><creator>Jeong, Soyeong</creator><creator>Purkayastha, Sukannya</creator><creator>Kuribayashi, Tatsuki</creator><creator>Clifford, Teresa</creator><creator>Jayakumar, Thanmay</creator><creator>Torrent, Tiago Timponi</creator><creator>Ehsan, Toqeer</creator><creator>Araujo, Vladimir</creator><creator>Kementchedjhieva, Yova</creator><creator>Burzo, Zara</creator><creator>Lim, Zheng Wei</creator><creator>Yong, Zheng Xin</creator><creator>Ignat, Oana</creator><creator>Nwatu, Joan</creator><creator>Mihalcea, Rada</creator><creator>Solorio, Thamar</creator><creator>Aji, Alham Fikri</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240609</creationdate><title>CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark</title><author>Romero, David ; Lyu, Chenyang ; Wibowo, Haryo Akbarianto ; Lynn, Teresa ; Hamed, Injy ; Kishore, Aditya Nanda ; Mandal, Aishik ; Dragonetti, Alina ; Abzaliev, Artem ; Tonja, Atnafu Lambebo ; Balcha, Bontu Fufa ; Whitehouse, Chenxi ; Salamea, Christian ; Velasco, Dan John ; Adelani, David Ifeoluwa ; Meur, David Le ; Villa-Cueva, Emilio ; Koto, Fajri ; Farooqui, Fauzan ; Belcavello, Frederico ; Batnasan, Ganzorig ; Vallejo, Gisela ; Caulfield, Grainne ; Ivetta, Guido ; Song, Haiyue ; Ademtew, Henok Biadglign ; Maina, Hernán ; Lovenia, Holy ; Azime, Israel Abebe ; Cruz, Jan Christian Blaise ; Gala, Jay ; Geng, Jiahui ; Ortiz-Barajas, Jesus-German ; Baek, Jinheon ; Dunstan, Jocelyn ; Alemany, Laura Alonso ; Nagasinghe, Kumaranage Ravindu Yasas ; Benotti, Luciana ; D'Haro, Luis Fernando ; Viridiano, Marcelo ; Estecha-Garitagoitia, Marcos ; Cabrera, Maria Camila Buitrago ; Rodríguez-Cantelar, Mario ; Jouitteau, Mélanie ; Mihaylov, Mihail ; Imam, Mohamed Fazli Mohamed ; Adilazuarda, Muhammad Farid ; Gochoo, Munkhjargal ; Otgonbold, Munkh-Erdene ; Etori, Naome ; Niyomugisha, Olivier ; Silva, Paula Mónica ; Chitale, Pranjal ; Dabre, Raj ; Chevi, Rendi ; Zhang, Ruochen ; Diandaru, Ryandito ; Cahyawijaya, Samuel ; Góngora, Santiago ; Jeong, Soyeong ; Purkayastha, Sukannya ; Kuribayashi, Tatsuki ; Clifford, Teresa ; Jayakumar, Thanmay ; Torrent, Tiago Timponi ; Ehsan, Toqeer ; Araujo, Vladimir ; Kementchedjhieva, Yova ; Burzo, Zara ; Lim, Zheng Wei ; Yong, Zheng Xin ; Ignat, Oana ; Nwatu, Joan ; Mihalcea, Rada ; Solorio, Thamar ; Aji, Alham Fikri</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-258885d1326ecaed58a26f522c28d6c9c5455c5abddd5a0b2f32f90127bdf9053</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Romero, David</creatorcontrib><creatorcontrib>Lyu, Chenyang</creatorcontrib><creatorcontrib>Wibowo, Haryo Akbarianto</creatorcontrib><creatorcontrib>Lynn, Teresa</creatorcontrib><creatorcontrib>Hamed, Injy</creatorcontrib><creatorcontrib>Kishore, Aditya Nanda</creatorcontrib><creatorcontrib>Mandal, Aishik</creatorcontrib><creatorcontrib>Dragonetti, Alina</creatorcontrib><creatorcontrib>Abzaliev, Artem</creatorcontrib><creatorcontrib>Tonja, Atnafu Lambebo</creatorcontrib><creatorcontrib>Balcha, Bontu Fufa</creatorcontrib><creatorcontrib>Whitehouse, Chenxi</creatorcontrib><creatorcontrib>Salamea, Christian</creatorcontrib><creatorcontrib>Velasco, Dan John</creatorcontrib><creatorcontrib>Adelani, David Ifeoluwa</creatorcontrib><creatorcontrib>Meur, David Le</creatorcontrib><creatorcontrib>Villa-Cueva, Emilio</creatorcontrib><creatorcontrib>Koto, Fajri</creatorcontrib><creatorcontrib>Farooqui, Fauzan</creatorcontrib><creatorcontrib>Belcavello, Frederico</creatorcontrib><creatorcontrib>Batnasan, Ganzorig</creatorcontrib><creatorcontrib>Vallejo, Gisela</creatorcontrib><creatorcontrib>Caulfield, Grainne</creatorcontrib><creatorcontrib>Ivetta, Guido</creatorcontrib><creatorcontrib>Song, Haiyue</creatorcontrib><creatorcontrib>Ademtew, Henok Biadglign</creatorcontrib><creatorcontrib>Maina, Hernán</creatorcontrib><creatorcontrib>Lovenia, Holy</creatorcontrib><creatorcontrib>Azime, Israel Abebe</creatorcontrib><creatorcontrib>Cruz, Jan Christian Blaise</creatorcontrib><creatorcontrib>Gala, Jay</creatorcontrib><creatorcontrib>Geng, Jiahui</creatorcontrib><creatorcontrib>Ortiz-Barajas, Jesus-German</creatorcontrib><creatorcontrib>Baek, Jinheon</creatorcontrib><creatorcontrib>Dunstan, Jocelyn</creatorcontrib><creatorcontrib>Alemany, Laura Alonso</creatorcontrib><creatorcontrib>Nagasinghe, Kumaranage Ravindu Yasas</creatorcontrib><creatorcontrib>Benotti, Luciana</creatorcontrib><creatorcontrib>D'Haro, Luis Fernando</creatorcontrib><creatorcontrib>Viridiano, Marcelo</creatorcontrib><creatorcontrib>Estecha-Garitagoitia, Marcos</creatorcontrib><creatorcontrib>Cabrera, Maria Camila Buitrago</creatorcontrib><creatorcontrib>Rodríguez-Cantelar, Mario</creatorcontrib><creatorcontrib>Jouitteau, Mélanie</creatorcontrib><creatorcontrib>Mihaylov, Mihail</creatorcontrib><creatorcontrib>Imam, Mohamed Fazli Mohamed</creatorcontrib><creatorcontrib>Adilazuarda, Muhammad Farid</creatorcontrib><creatorcontrib>Gochoo, Munkhjargal</creatorcontrib><creatorcontrib>Otgonbold, Munkh-Erdene</creatorcontrib><creatorcontrib>Etori, Naome</creatorcontrib><creatorcontrib>Niyomugisha, Olivier</creatorcontrib><creatorcontrib>Silva, Paula Mónica</creatorcontrib><creatorcontrib>Chitale, Pranjal</creatorcontrib><creatorcontrib>Dabre, Raj</creatorcontrib><creatorcontrib>Chevi, Rendi</creatorcontrib><creatorcontrib>Zhang, Ruochen</creatorcontrib><creatorcontrib>Diandaru, Ryandito</creatorcontrib><creatorcontrib>Cahyawijaya, Samuel</creatorcontrib><creatorcontrib>Góngora, Santiago</creatorcontrib><creatorcontrib>Jeong, Soyeong</creatorcontrib><creatorcontrib>Purkayastha, Sukannya</creatorcontrib><creatorcontrib>Kuribayashi, Tatsuki</creatorcontrib><creatorcontrib>Clifford, Teresa</creatorcontrib><creatorcontrib>Jayakumar, Thanmay</creatorcontrib><creatorcontrib>Torrent, Tiago Timponi</creatorcontrib><creatorcontrib>Ehsan, Toqeer</creatorcontrib><creatorcontrib>Araujo, Vladimir</creatorcontrib><creatorcontrib>Kementchedjhieva, Yova</creatorcontrib><creatorcontrib>Burzo, Zara</creatorcontrib><creatorcontrib>Lim, Zheng Wei</creatorcontrib><creatorcontrib>Yong, Zheng Xin</creatorcontrib><creatorcontrib>Ignat, Oana</creatorcontrib><creatorcontrib>Nwatu, Joan</creatorcontrib><creatorcontrib>Mihalcea, Rada</creatorcontrib><creatorcontrib>Solorio, Thamar</creatorcontrib><creatorcontrib>Aji, Alham Fikri</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Romero, David</au><au>Lyu, Chenyang</au><au>Wibowo, Haryo Akbarianto</au><au>Lynn, Teresa</au><au>Hamed, Injy</au><au>Kishore, Aditya Nanda</au><au>Mandal, Aishik</au><au>Dragonetti, Alina</au><au>Abzaliev, Artem</au><au>Tonja, Atnafu Lambebo</au><au>Balcha, Bontu Fufa</au><au>Whitehouse, Chenxi</au><au>Salamea, Christian</au><au>Velasco, Dan John</au><au>Adelani, David Ifeoluwa</au><au>Meur, David Le</au><au>Villa-Cueva, Emilio</au><au>Koto, Fajri</au><au>Farooqui, Fauzan</au><au>Belcavello, Frederico</au><au>Batnasan, Ganzorig</au><au>Vallejo, Gisela</au><au>Caulfield, Grainne</au><au>Ivetta, Guido</au><au>Song, Haiyue</au><au>Ademtew, Henok Biadglign</au><au>Maina, Hernán</au><au>Lovenia, Holy</au><au>Azime, Israel Abebe</au><au>Cruz, Jan Christian Blaise</au><au>Gala, Jay</au><au>Geng, Jiahui</au><au>Ortiz-Barajas, Jesus-German</au><au>Baek, Jinheon</au><au>Dunstan, Jocelyn</au><au>Alemany, Laura Alonso</au><au>Nagasinghe, Kumaranage Ravindu Yasas</au><au>Benotti, Luciana</au><au>D'Haro, Luis Fernando</au><au>Viridiano, Marcelo</au><au>Estecha-Garitagoitia, Marcos</au><au>Cabrera, Maria Camila Buitrago</au><au>Rodríguez-Cantelar, Mario</au><au>Jouitteau, Mélanie</au><au>Mihaylov, Mihail</au><au>Imam, Mohamed Fazli Mohamed</au><au>Adilazuarda, Muhammad Farid</au><au>Gochoo, Munkhjargal</au><au>Otgonbold, Munkh-Erdene</au><au>Etori, Naome</au><au>Niyomugisha, Olivier</au><au>Silva, Paula Mónica</au><au>Chitale, Pranjal</au><au>Dabre, Raj</au><au>Chevi, Rendi</au><au>Zhang, Ruochen</au><au>Diandaru, Ryandito</au><au>Cahyawijaya, Samuel</au><au>Góngora, Santiago</au><au>Jeong, Soyeong</au><au>Purkayastha, Sukannya</au><au>Kuribayashi, Tatsuki</au><au>Clifford, Teresa</au><au>Jayakumar, Thanmay</au><au>Torrent, Tiago Timponi</au><au>Ehsan, Toqeer</au><au>Araujo, Vladimir</au><au>Kementchedjhieva, Yova</au><au>Burzo, Zara</au><au>Lim, Zheng Wei</au><au>Yong, Zheng Xin</au><au>Ignat, Oana</au><au>Nwatu, Joan</au><au>Mihalcea, Rada</au><au>Solorio, Thamar</au><au>Aji, Alham Fikri</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark</atitle><date>2024-06-09</date><risdate>2024</risdate><abstract>Visual Question Answering (VQA) is an important task in multimodal AI, and it is often used to test the ability of vision-language models to understand and reason on knowledge present in both visual and textual data. However, most of the current VQA models use datasets that are primarily focused on English and a few major world languages, with images that are typically Western-centric. While recent efforts have tried to increase the number of languages covered on VQA datasets, they still lack diversity in low-resource languages. More importantly, although these datasets often extend their linguistic range via translation or some other approaches, they usually keep images the same, resulting in narrow cultural representation. To address these limitations, we construct CVQA, a new Culturally-diverse multilingual Visual Question Answering benchmark, designed to cover a rich set of languages and cultures, where we engage native speakers and cultural experts in the data collection process. As a result, CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We then benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models. This benchmark can serve as a probing evaluation suite for assessing the cultural capability and bias of multimodal models and hopefully encourage more research efforts toward increasing cultural awareness and linguistic diversity in this field.</abstract><doi>10.48550/arxiv.2406.05967</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2406.05967
ispartof
issn
language eng
recordid cdi_arxiv_primary_2406_05967
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Learning
title CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-02T23%3A06%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CVQA:%20Culturally-diverse%20Multilingual%20Visual%20Question%20Answering%20Benchmark&rft.au=Romero,%20David&rft.date=2024-06-09&rft_id=info:doi/10.48550/arxiv.2406.05967&rft_dat=%3Carxiv_GOX%3E2406_05967%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true