AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages
Dialogue generation is an important NLP task fraught with many challenges. The challenges become more daunting for low-resource African languages. To enable the creation of dialogue agents for African languages, we contribute the first high-quality dialogue datasets for 6 African languages: Swahili,...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Adewumi, Tosin Adeyemi, Mofetoluwa Anuoluwapo, Aremu Peters, Bukola Buzaaba, Happy Samuel, Oyerinde Rufai, Amina Mardiyyah Ajibade, Benjamin Gwadabe, Tajudeen Traore, Mory Moussou Koulibaly Ajayi, Tunde Muhammad, Shamsuddeen Baruwa, Ahmed Owoicho, Paul Ogunremi, Tolulope Ngigi, Phylis Ahia, Orevaoghene Nasir, Ruqayya Liwicki, Foteini Liwicki, Marcus |
description | Dialogue generation is an important NLP task fraught with many challenges.
The challenges become more daunting for low-resource African languages. To
enable the creation of dialogue agents for African languages, we contribute the
first high-quality dialogue datasets for 6 African languages: Swahili, Wolof,
Hausa, Nigerian Pidgin English, Kinyarwanda & Yor\`ub\'a. These datasets
consist of 1,500 turns each, which we translate from a portion of the English
multi-domain MultiWOZ dataset. Subsequently, we investigate & analyze the
effectiveness of modelling through transfer learning by utilziing
state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We
compare the models with a simple seq2seq baseline using perplexity. Besides
this, we conduct human evaluation of single-turn conversations by using
majority votes and measure inter-annotator agreement (IAA). We find that the
hypothesis that deep monolingual models learn some abstractions that generalize
across languages holds. We observe human-like conversations, to different
degrees, in 5 out of the 6 languages. The language with the most transferable
properties is the Nigerian Pidgin English, with a human-likeness score of
78.1%, of which 34.4% are unanimous. We freely provide the datasets and host
the model checkpoints/demos on the HuggingFace hub for public access. |
doi_str_mv | 10.48550/arxiv.2204.08083 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2204_08083</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2204_08083</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-e7f72de8aadf094309098064bd9746fd7fe7a175c3690e299ba1fbc64d1607ae3</originalsourceid><addsrcrecordid>eNotUMFqhDAUzKWHsu0H9NR8QLXRRGN6W-x2uyAsFKHQizz1RQLWSKLtLv35rranGYZhhhlC7iIWiixJ2CO4k_kK45iJkGUs49fkZ6udeT9-PNHcunH2VFtHd6ext2YyQ0dzZ70Pigudoaelg8FrdFCb3kzn1bzH4SJMxg7UavpsoLfdjJ6agRb2O3hDb2fX4ANdmhq4qLCEdehvyJWG3uPtP25I-bIr89egOO4P-bYIIJU8QKll3GIG0GqmBGeKqYylom6VFKlupUYJkUwaniqGsVI1RLpuUtFGKZOAfEPu_2LX9dXozCe4c7W8UK0v8F8fHFlz</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages</title><source>arXiv.org</source><creator>Adewumi, Tosin ; Adeyemi, Mofetoluwa ; Anuoluwapo, Aremu ; Peters, Bukola ; Buzaaba, Happy ; Samuel, Oyerinde ; Rufai, Amina Mardiyyah ; Ajibade, Benjamin ; Gwadabe, Tajudeen ; Traore, Mory Moussou Koulibaly ; Ajayi, Tunde ; Muhammad, Shamsuddeen ; Baruwa, Ahmed ; Owoicho, Paul ; Ogunremi, Tolulope ; Ngigi, Phylis ; Ahia, Orevaoghene ; Nasir, Ruqayya ; Liwicki, Foteini ; Liwicki, Marcus</creator><creatorcontrib>Adewumi, Tosin ; Adeyemi, Mofetoluwa ; Anuoluwapo, Aremu ; Peters, Bukola ; Buzaaba, Happy ; Samuel, Oyerinde ; Rufai, Amina Mardiyyah ; Ajibade, Benjamin ; Gwadabe, Tajudeen ; Traore, Mory Moussou Koulibaly ; Ajayi, Tunde ; Muhammad, Shamsuddeen ; Baruwa, Ahmed ; Owoicho, Paul ; Ogunremi, Tolulope ; Ngigi, Phylis ; Ahia, Orevaoghene ; Nasir, Ruqayya ; Liwicki, Foteini ; Liwicki, Marcus</creatorcontrib><description>Dialogue generation is an important NLP task fraught with many challenges.
The challenges become more daunting for low-resource African languages. To
enable the creation of dialogue agents for African languages, we contribute the
first high-quality dialogue datasets for 6 African languages: Swahili, Wolof,
Hausa, Nigerian Pidgin English, Kinyarwanda & Yor\`ub\'a. These datasets
consist of 1,500 turns each, which we translate from a portion of the English
multi-domain MultiWOZ dataset. Subsequently, we investigate & analyze the
effectiveness of modelling through transfer learning by utilziing
state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We
compare the models with a simple seq2seq baseline using perplexity. Besides
this, we conduct human evaluation of single-turn conversations by using
majority votes and measure inter-annotator agreement (IAA). We find that the
hypothesis that deep monolingual models learn some abstractions that generalize
across languages holds. We observe human-like conversations, to different
degrees, in 5 out of the 6 languages. The language with the most transferable
properties is the Nigerian Pidgin English, with a human-likeness score of
78.1%, of which 34.4% are unanimous. We freely provide the datasets and host
the model checkpoints/demos on the HuggingFace hub for public access.</description><identifier>DOI: 10.48550/arxiv.2204.08083</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2022-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2204.08083$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2204.08083$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Adewumi, Tosin</creatorcontrib><creatorcontrib>Adeyemi, Mofetoluwa</creatorcontrib><creatorcontrib>Anuoluwapo, Aremu</creatorcontrib><creatorcontrib>Peters, Bukola</creatorcontrib><creatorcontrib>Buzaaba, Happy</creatorcontrib><creatorcontrib>Samuel, Oyerinde</creatorcontrib><creatorcontrib>Rufai, Amina Mardiyyah</creatorcontrib><creatorcontrib>Ajibade, Benjamin</creatorcontrib><creatorcontrib>Gwadabe, Tajudeen</creatorcontrib><creatorcontrib>Traore, Mory Moussou Koulibaly</creatorcontrib><creatorcontrib>Ajayi, Tunde</creatorcontrib><creatorcontrib>Muhammad, Shamsuddeen</creatorcontrib><creatorcontrib>Baruwa, Ahmed</creatorcontrib><creatorcontrib>Owoicho, Paul</creatorcontrib><creatorcontrib>Ogunremi, Tolulope</creatorcontrib><creatorcontrib>Ngigi, Phylis</creatorcontrib><creatorcontrib>Ahia, Orevaoghene</creatorcontrib><creatorcontrib>Nasir, Ruqayya</creatorcontrib><creatorcontrib>Liwicki, Foteini</creatorcontrib><creatorcontrib>Liwicki, Marcus</creatorcontrib><title>AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages</title><description>Dialogue generation is an important NLP task fraught with many challenges.
The challenges become more daunting for low-resource African languages. To
enable the creation of dialogue agents for African languages, we contribute the
first high-quality dialogue datasets for 6 African languages: Swahili, Wolof,
Hausa, Nigerian Pidgin English, Kinyarwanda & Yor\`ub\'a. These datasets
consist of 1,500 turns each, which we translate from a portion of the English
multi-domain MultiWOZ dataset. Subsequently, we investigate & analyze the
effectiveness of modelling through transfer learning by utilziing
state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We
compare the models with a simple seq2seq baseline using perplexity. Besides
this, we conduct human evaluation of single-turn conversations by using
majority votes and measure inter-annotator agreement (IAA). We find that the
hypothesis that deep monolingual models learn some abstractions that generalize
across languages holds. We observe human-like conversations, to different
degrees, in 5 out of the 6 languages. The language with the most transferable
properties is the Nigerian Pidgin English, with a human-likeness score of
78.1%, of which 34.4% are unanimous. We freely provide the datasets and host
the model checkpoints/demos on the HuggingFace hub for public access.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotUMFqhDAUzKWHsu0H9NR8QLXRRGN6W-x2uyAsFKHQizz1RQLWSKLtLv35rranGYZhhhlC7iIWiixJ2CO4k_kK45iJkGUs49fkZ6udeT9-PNHcunH2VFtHd6ext2YyQ0dzZ70Pigudoaelg8FrdFCb3kzn1bzH4SJMxg7UavpsoLfdjJ6agRb2O3hDb2fX4ANdmhq4qLCEdehvyJWG3uPtP25I-bIr89egOO4P-bYIIJU8QKll3GIG0GqmBGeKqYylom6VFKlupUYJkUwaniqGsVI1RLpuUtFGKZOAfEPu_2LX9dXozCe4c7W8UK0v8F8fHFlz</recordid><startdate>20220417</startdate><enddate>20220417</enddate><creator>Adewumi, Tosin</creator><creator>Adeyemi, Mofetoluwa</creator><creator>Anuoluwapo, Aremu</creator><creator>Peters, Bukola</creator><creator>Buzaaba, Happy</creator><creator>Samuel, Oyerinde</creator><creator>Rufai, Amina Mardiyyah</creator><creator>Ajibade, Benjamin</creator><creator>Gwadabe, Tajudeen</creator><creator>Traore, Mory Moussou Koulibaly</creator><creator>Ajayi, Tunde</creator><creator>Muhammad, Shamsuddeen</creator><creator>Baruwa, Ahmed</creator><creator>Owoicho, Paul</creator><creator>Ogunremi, Tolulope</creator><creator>Ngigi, Phylis</creator><creator>Ahia, Orevaoghene</creator><creator>Nasir, Ruqayya</creator><creator>Liwicki, Foteini</creator><creator>Liwicki, Marcus</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220417</creationdate><title>AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages</title><author>Adewumi, Tosin ; Adeyemi, Mofetoluwa ; Anuoluwapo, Aremu ; Peters, Bukola ; Buzaaba, Happy ; Samuel, Oyerinde ; Rufai, Amina Mardiyyah ; Ajibade, Benjamin ; Gwadabe, Tajudeen ; Traore, Mory Moussou Koulibaly ; Ajayi, Tunde ; Muhammad, Shamsuddeen ; Baruwa, Ahmed ; Owoicho, Paul ; Ogunremi, Tolulope ; Ngigi, Phylis ; Ahia, Orevaoghene ; Nasir, Ruqayya ; Liwicki, Foteini ; Liwicki, Marcus</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-e7f72de8aadf094309098064bd9746fd7fe7a175c3690e299ba1fbc64d1607ae3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Adewumi, Tosin</creatorcontrib><creatorcontrib>Adeyemi, Mofetoluwa</creatorcontrib><creatorcontrib>Anuoluwapo, Aremu</creatorcontrib><creatorcontrib>Peters, Bukola</creatorcontrib><creatorcontrib>Buzaaba, Happy</creatorcontrib><creatorcontrib>Samuel, Oyerinde</creatorcontrib><creatorcontrib>Rufai, Amina Mardiyyah</creatorcontrib><creatorcontrib>Ajibade, Benjamin</creatorcontrib><creatorcontrib>Gwadabe, Tajudeen</creatorcontrib><creatorcontrib>Traore, Mory Moussou Koulibaly</creatorcontrib><creatorcontrib>Ajayi, Tunde</creatorcontrib><creatorcontrib>Muhammad, Shamsuddeen</creatorcontrib><creatorcontrib>Baruwa, Ahmed</creatorcontrib><creatorcontrib>Owoicho, Paul</creatorcontrib><creatorcontrib>Ogunremi, Tolulope</creatorcontrib><creatorcontrib>Ngigi, Phylis</creatorcontrib><creatorcontrib>Ahia, Orevaoghene</creatorcontrib><creatorcontrib>Nasir, Ruqayya</creatorcontrib><creatorcontrib>Liwicki, Foteini</creatorcontrib><creatorcontrib>Liwicki, Marcus</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Adewumi, Tosin</au><au>Adeyemi, Mofetoluwa</au><au>Anuoluwapo, Aremu</au><au>Peters, Bukola</au><au>Buzaaba, Happy</au><au>Samuel, Oyerinde</au><au>Rufai, Amina Mardiyyah</au><au>Ajibade, Benjamin</au><au>Gwadabe, Tajudeen</au><au>Traore, Mory Moussou Koulibaly</au><au>Ajayi, Tunde</au><au>Muhammad, Shamsuddeen</au><au>Baruwa, Ahmed</au><au>Owoicho, Paul</au><au>Ogunremi, Tolulope</au><au>Ngigi, Phylis</au><au>Ahia, Orevaoghene</au><au>Nasir, Ruqayya</au><au>Liwicki, Foteini</au><au>Liwicki, Marcus</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages</atitle><date>2022-04-17</date><risdate>2022</risdate><abstract>Dialogue generation is an important NLP task fraught with many challenges.
The challenges become more daunting for low-resource African languages. To
enable the creation of dialogue agents for African languages, we contribute the
first high-quality dialogue datasets for 6 African languages: Swahili, Wolof,
Hausa, Nigerian Pidgin English, Kinyarwanda & Yor\`ub\'a. These datasets
consist of 1,500 turns each, which we translate from a portion of the English
multi-domain MultiWOZ dataset. Subsequently, we investigate & analyze the
effectiveness of modelling through transfer learning by utilziing
state-of-the-art (SoTA) deep monolingual models: DialoGPT and BlenderBot. We
compare the models with a simple seq2seq baseline using perplexity. Besides
this, we conduct human evaluation of single-turn conversations by using
majority votes and measure inter-annotator agreement (IAA). We find that the
hypothesis that deep monolingual models learn some abstractions that generalize
across languages holds. We observe human-like conversations, to different
degrees, in 5 out of the 6 languages. The language with the most transferable
properties is the Nigerian Pidgin English, with a human-likeness score of
78.1%, of which 34.4% are unanimous. We freely provide the datasets and host
the model checkpoints/demos on the HuggingFace hub for public access.</abstract><doi>10.48550/arxiv.2204.08083</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2204.08083 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2204_08083 |
source | arXiv.org |
subjects | Computer Science - Computation and Language |
title | AfriWOZ: Corpus for Exploiting Cross-Lingual Transferability for Generation of Dialogues in Low-Resource, African Languages |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T12%3A56%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=AfriWOZ:%20Corpus%20for%20Exploiting%20Cross-Lingual%20Transferability%20for%20Generation%20of%20Dialogues%20in%20Low-Resource,%20African%20Languages&rft.au=Adewumi,%20Tosin&rft.date=2022-04-17&rft_id=info:doi/10.48550/arxiv.2204.08083&rft_dat=%3Carxiv_GOX%3E2204_08083%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |