AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages

Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2022-05
Hauptverfasser: Adewumi, Tosin, Adeyemi, Mofetoluwa, Aremu Anuoluwapo, Peters, Bukola, Buzaaba, Happy, Oyerinde Samuel, Rufai, Amina Mardiyyah, Ajibade, Benjamin, Gwadabe, Tajudeen, Mory Moussou Koulibaly Traore, Ajayi, Tunde, Shamsuddeen Muhammad, Baruwa, Ahmed, Owoicho, Paul, Ogunremi, Tolulope, Ngigi, Phylis, Ahia, Orevaoghene, Nasir, Ruqayya, Liwicki, Foteini, Liwicki, Marcus
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Adewumi, Tosin
Adeyemi, Mofetoluwa
Aremu Anuoluwapo
Peters, Bukola
Buzaaba, Happy
Oyerinde Samuel
Rufai, Amina Mardiyyah
Ajibade, Benjamin
Gwadabe, Tajudeen
Mory Moussou Koulibaly Traore
Ajayi, Tunde
Shamsuddeen Muhammad
Baruwa, Ahmed
Owoicho, Paul
Ogunremi, Tolulope
Ngigi, Phylis
Ahia, Orevaoghene
Nasir, Ruqayya
Liwicki, Foteini
Liwicki, Marcus
description Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda & Yorùbá. These datasets consist of 1,500 turns each, which we translate from a portion of the English multi-domain MultiWOZ dataset. Subsequently, we investigate & analyze the effectiveness of modelling through transfer learning by utilziing state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We compare the models with a simple seq2seq baseline using perplexity. Besides this, we conduct human evaluation of single-turn conversations by using majority votes and measure inter-annotator agreement (IAA). We find that the hypothesis that deep monolingual models learn some abstractions that generalize across languages holds. We observe human-like conversations, to different degrees, in 5 out of the 6 languages. The language with the most transferable properties is the Nigerian Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are unanimous. We freely provide the datasets and host the model checkpoints/demos on the HuggingFace hub for public access.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2652412446</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2652412446</sourcerecordid><originalsourceid>FETCH-proquest_journals_26524124463</originalsourceid><addsrcrecordid>eNqNTckKwjAQDYKgqP8w4NVCTRfFm9TtIAhSELxILElJCZmaaVHx543iB3h6j7d2WJ9H0TSYx5z32IioCsOQpzOeJFGfvZbK6dPhvIAMXd0SKHSwftQGdaNtCZlDomDvaSsM5E5YUtKJqza6eX7DW2m90Gi0gApWWhgsW0mgLezxHhwlYesKOYHPUyG8Kj5jpaQh6yphSI5-OGDjzTrPdkHt8OYnmkvlq9ZbF54mPJ7yOE6j_1JvM9BOeg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2652412446</pqid></control><display><type>article</type><title>AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages</title><source>Free E- Journals</source><creator>Adewumi, Tosin ; Adeyemi, Mofetoluwa ; Aremu Anuoluwapo ; Peters, Bukola ; Buzaaba, Happy ; Oyerinde Samuel ; Rufai, Amina Mardiyyah ; Ajibade, Benjamin ; Gwadabe, Tajudeen ; Mory Moussou Koulibaly Traore ; Ajayi, Tunde ; Shamsuddeen Muhammad ; Baruwa, Ahmed ; Owoicho, Paul ; Ogunremi, Tolulope ; Ngigi, Phylis ; Ahia, Orevaoghene ; Nasir, Ruqayya ; Liwicki, Foteini ; Liwicki, Marcus</creator><creatorcontrib>Adewumi, Tosin ; Adeyemi, Mofetoluwa ; Aremu Anuoluwapo ; Peters, Bukola ; Buzaaba, Happy ; Oyerinde Samuel ; Rufai, Amina Mardiyyah ; Ajibade, Benjamin ; Gwadabe, Tajudeen ; Mory Moussou Koulibaly Traore ; Ajayi, Tunde ; Shamsuddeen Muhammad ; Baruwa, Ahmed ; Owoicho, Paul ; Ogunremi, Tolulope ; Ngigi, Phylis ; Ahia, Orevaoghene ; Nasir, Ruqayya ; Liwicki, Foteini ; Liwicki, Marcus</creatorcontrib><description>Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda &amp; Yorùbá. These datasets consist of 1,500 turns each, which we translate from a portion of the English multi-domain MultiWOZ dataset. Subsequently, we investigate &amp; analyze the effectiveness of modelling through transfer learning by utilziing state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We compare the models with a simple seq2seq baseline using perplexity. Besides this, we conduct human evaluation of single-turn conversations by using majority votes and measure inter-annotator agreement (IAA). We find that the hypothesis that deep monolingual models learn some abstractions that generalize across languages holds. We observe human-like conversations, to different degrees, in 5 out of the 6 languages. The language with the most transferable properties is the Nigerian Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are unanimous. We freely provide the datasets and host the model checkpoints/demos on the HuggingFace hub for public access.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>African languages ; Datasets ; English language ; Evaluation ; Hypotheses ; Languages ; Speech recognition</subject><ispartof>arXiv.org, 2022-05</ispartof><rights>2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Adewumi, Tosin</creatorcontrib><creatorcontrib>Adeyemi, Mofetoluwa</creatorcontrib><creatorcontrib>Aremu Anuoluwapo</creatorcontrib><creatorcontrib>Peters, Bukola</creatorcontrib><creatorcontrib>Buzaaba, Happy</creatorcontrib><creatorcontrib>Oyerinde Samuel</creatorcontrib><creatorcontrib>Rufai, Amina Mardiyyah</creatorcontrib><creatorcontrib>Ajibade, Benjamin</creatorcontrib><creatorcontrib>Gwadabe, Tajudeen</creatorcontrib><creatorcontrib>Mory Moussou Koulibaly Traore</creatorcontrib><creatorcontrib>Ajayi, Tunde</creatorcontrib><creatorcontrib>Shamsuddeen Muhammad</creatorcontrib><creatorcontrib>Baruwa, Ahmed</creatorcontrib><creatorcontrib>Owoicho, Paul</creatorcontrib><creatorcontrib>Ogunremi, Tolulope</creatorcontrib><creatorcontrib>Ngigi, Phylis</creatorcontrib><creatorcontrib>Ahia, Orevaoghene</creatorcontrib><creatorcontrib>Nasir, Ruqayya</creatorcontrib><creatorcontrib>Liwicki, Foteini</creatorcontrib><creatorcontrib>Liwicki, Marcus</creatorcontrib><title>AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages</title><title>arXiv.org</title><description>Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda &amp; Yorùbá. These datasets consist of 1,500 turns each, which we translate from a portion of the English multi-domain MultiWOZ dataset. Subsequently, we investigate &amp; analyze the effectiveness of modelling through transfer learning by utilziing state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We compare the models with a simple seq2seq baseline using perplexity. Besides this, we conduct human evaluation of single-turn conversations by using majority votes and measure inter-annotator agreement (IAA). We find that the hypothesis that deep monolingual models learn some abstractions that generalize across languages holds. We observe human-like conversations, to different degrees, in 5 out of the 6 languages. The language with the most transferable properties is the Nigerian Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are unanimous. We freely provide the datasets and host the model checkpoints/demos on the HuggingFace hub for public access.</description><subject>African languages</subject><subject>Datasets</subject><subject>English language</subject><subject>Evaluation</subject><subject>Hypotheses</subject><subject>Languages</subject><subject>Speech recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNTckKwjAQDYKgqP8w4NVCTRfFm9TtIAhSELxILElJCZmaaVHx543iB3h6j7d2WJ9H0TSYx5z32IioCsOQpzOeJFGfvZbK6dPhvIAMXd0SKHSwftQGdaNtCZlDomDvaSsM5E5YUtKJqza6eX7DW2m90Gi0gApWWhgsW0mgLezxHhwlYesKOYHPUyG8Kj5jpaQh6yphSI5-OGDjzTrPdkHt8OYnmkvlq9ZbF54mPJ7yOE6j_1JvM9BOeg</recordid><startdate>20220519</startdate><enddate>20220519</enddate><creator>Adewumi, Tosin</creator><creator>Adeyemi, Mofetoluwa</creator><creator>Aremu Anuoluwapo</creator><creator>Peters, Bukola</creator><creator>Buzaaba, Happy</creator><creator>Oyerinde Samuel</creator><creator>Rufai, Amina Mardiyyah</creator><creator>Ajibade, Benjamin</creator><creator>Gwadabe, Tajudeen</creator><creator>Mory Moussou Koulibaly Traore</creator><creator>Ajayi, Tunde</creator><creator>Shamsuddeen Muhammad</creator><creator>Baruwa, Ahmed</creator><creator>Owoicho, Paul</creator><creator>Ogunremi, Tolulope</creator><creator>Ngigi, Phylis</creator><creator>Ahia, Orevaoghene</creator><creator>Nasir, Ruqayya</creator><creator>Liwicki, Foteini</creator><creator>Liwicki, Marcus</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20220519</creationdate><title>AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages</title><author>Adewumi, Tosin ; Adeyemi, Mofetoluwa ; Aremu Anuoluwapo ; Peters, Bukola ; Buzaaba, Happy ; Oyerinde Samuel ; Rufai, Amina Mardiyyah ; Ajibade, Benjamin ; Gwadabe, Tajudeen ; Mory Moussou Koulibaly Traore ; Ajayi, Tunde ; Shamsuddeen Muhammad ; Baruwa, Ahmed ; Owoicho, Paul ; Ogunremi, Tolulope ; Ngigi, Phylis ; Ahia, Orevaoghene ; Nasir, Ruqayya ; Liwicki, Foteini ; Liwicki, Marcus</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_26524124463</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>African languages</topic><topic>Datasets</topic><topic>English language</topic><topic>Evaluation</topic><topic>Hypotheses</topic><topic>Languages</topic><topic>Speech recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Adewumi, Tosin</creatorcontrib><creatorcontrib>Adeyemi, Mofetoluwa</creatorcontrib><creatorcontrib>Aremu Anuoluwapo</creatorcontrib><creatorcontrib>Peters, Bukola</creatorcontrib><creatorcontrib>Buzaaba, Happy</creatorcontrib><creatorcontrib>Oyerinde Samuel</creatorcontrib><creatorcontrib>Rufai, Amina Mardiyyah</creatorcontrib><creatorcontrib>Ajibade, Benjamin</creatorcontrib><creatorcontrib>Gwadabe, Tajudeen</creatorcontrib><creatorcontrib>Mory Moussou Koulibaly Traore</creatorcontrib><creatorcontrib>Ajayi, Tunde</creatorcontrib><creatorcontrib>Shamsuddeen Muhammad</creatorcontrib><creatorcontrib>Baruwa, Ahmed</creatorcontrib><creatorcontrib>Owoicho, Paul</creatorcontrib><creatorcontrib>Ogunremi, Tolulope</creatorcontrib><creatorcontrib>Ngigi, Phylis</creatorcontrib><creatorcontrib>Ahia, Orevaoghene</creatorcontrib><creatorcontrib>Nasir, Ruqayya</creatorcontrib><creatorcontrib>Liwicki, Foteini</creatorcontrib><creatorcontrib>Liwicki, Marcus</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Adewumi, Tosin</au><au>Adeyemi, Mofetoluwa</au><au>Aremu Anuoluwapo</au><au>Peters, Bukola</au><au>Buzaaba, Happy</au><au>Oyerinde Samuel</au><au>Rufai, Amina Mardiyyah</au><au>Ajibade, Benjamin</au><au>Gwadabe, Tajudeen</au><au>Mory Moussou Koulibaly Traore</au><au>Ajayi, Tunde</au><au>Shamsuddeen Muhammad</au><au>Baruwa, Ahmed</au><au>Owoicho, Paul</au><au>Ogunremi, Tolulope</au><au>Ngigi, Phylis</au><au>Ahia, Orevaoghene</au><au>Nasir, Ruqayya</au><au>Liwicki, Foteini</au><au>Liwicki, Marcus</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages</atitle><jtitle>arXiv.org</jtitle><date>2022-05-19</date><risdate>2022</risdate><eissn>2331-8422</eissn><abstract>Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili, Wolof, Hausa, Nigerian Pidgin English, Kinyarwanda &amp; Yorùbá. These datasets consist of 1,500 turns each, which we translate from a portion of the English multi-domain MultiWOZ dataset. Subsequently, we investigate &amp; analyze the effectiveness of modelling through transfer learning by utilziing state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We compare the models with a simple seq2seq baseline using perplexity. Besides this, we conduct human evaluation of single-turn conversations by using majority votes and measure inter-annotator agreement (IAA). We find that the hypothesis that deep monolingual models learn some abstractions that generalize across languages holds. We observe human-like conversations, to different degrees, in 5 out of the 6 languages. The language with the most transferable properties is the Nigerian Pidgin English, with a human-likeness score of 78.1%, of which 34.4% are unanimous. We freely provide the datasets and host the model checkpoints/demos on the HuggingFace hub for public access.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2022-05
issn 2331-8422
language eng
recordid cdi_proquest_journals_2652412446
source Free E- Journals
subjects African languages
Datasets
English language
Evaluation
Hypotheses
Languages
Speech recognition
title AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T15%3A11%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=AfriWOZ:%20Corpus%20for%20Exploiting%20Cross-Lingual%20Transferability%20for%20Generation%20of%20Dialogues%20in%20Low-Resource,%20African%20Languages&rft.jtitle=arXiv.org&rft.au=Adewumi,%20Tosin&rft.date=2022-05-19&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2652412446%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2652412446&rft_id=info:pmid/&rfr_iscdi=true