Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding

Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languages, the augmented data sets are often noisy, and t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Guo, Yingmei, Shou, Linjun, Pei, Jian, Gong, Ming, Xu, Mingxing, Wu, Zhiyong, Jiang, Daxin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Guo, Yingmei
Shou, Linjun
Pei, Jian
Gong, Ming
Xu, Mingxing
Wu, Zhiyong
Jiang, Daxin
description Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languages, the augmented data sets are often noisy, and thus impede the performance of SLU models. In this paper we focus on mitigating noise in augmented data. We develop a denoising training approach. Multiple models are trained with data produced by various augmented methods. Those models provide supervision signals to each other. The experimental results show that our method outperforms the existing state of the art by 3.05 and 4.24 percentage points on two benchmark datasets, respectively. The code will be made open sourced on github.
doi_str_mv 10.48550/arxiv.2109.01583
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2109_01583</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2109_01583</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-61f049c5b8bc7a8e39c271dc400d06a35dcaf8af8a59b912a4079ed9beb683b93</originalsourceid><addsrcrecordid>eNotj89OhDAYxHvxYFYfwJN9AbClFNrjin8T1MOuZ_KVfhAiFNIW4769u6vJJJNJZib5EXLDWZorKdkd-J_hO8040ynjUolL0tcI3g2up52fJ_q2jnFYRqTv8xAOdLv2E7qIlj5ABLrDGGg3e3qPMaKnlZ9DSOrjeoWR7pb5Cx2t4RR7pJ_Oog8RnD0WrshFB2PA63_fkP3T4756SeqP59dqWydQlCIpeMdy3UqjTFuCQqHbrOS2zRmzrAAhbQudOklqo3kGOSs1Wm3QFEoYLTbk9u_2TNosfpjAH5oTcXMmFr-D9VKP</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding</title><source>arXiv.org</source><creator>Guo, Yingmei ; Shou, Linjun ; Pei, Jian ; Gong, Ming ; Xu, Mingxing ; Wu, Zhiyong ; Jiang, Daxin</creator><creatorcontrib>Guo, Yingmei ; Shou, Linjun ; Pei, Jian ; Gong, Ming ; Xu, Mingxing ; Wu, Zhiyong ; Jiang, Daxin</creatorcontrib><description>Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languages, the augmented data sets are often noisy, and thus impede the performance of SLU models. In this paper we focus on mitigating noise in augmented data. We develop a denoising training approach. Multiple models are trained with data produced by various augmented methods. Those models provide supervision signals to each other. The experimental results show that our method outperforms the existing state of the art by 3.05 and 4.24 percentage points on two benchmark datasets, respectively. The code will be made open sourced on github.</description><identifier>DOI: 10.48550/arxiv.2109.01583</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language</subject><creationdate>2021-09</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2109.01583$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2109.01583$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Guo, Yingmei</creatorcontrib><creatorcontrib>Shou, Linjun</creatorcontrib><creatorcontrib>Pei, Jian</creatorcontrib><creatorcontrib>Gong, Ming</creatorcontrib><creatorcontrib>Xu, Mingxing</creatorcontrib><creatorcontrib>Wu, Zhiyong</creatorcontrib><creatorcontrib>Jiang, Daxin</creatorcontrib><title>Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding</title><description>Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languages, the augmented data sets are often noisy, and thus impede the performance of SLU models. In this paper we focus on mitigating noise in augmented data. We develop a denoising training approach. Multiple models are trained with data produced by various augmented methods. Those models provide supervision signals to each other. The experimental results show that our method outperforms the existing state of the art by 3.05 and 4.24 percentage points on two benchmark datasets, respectively. The code will be made open sourced on github.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj89OhDAYxHvxYFYfwJN9AbClFNrjin8T1MOuZ_KVfhAiFNIW4769u6vJJJNJZib5EXLDWZorKdkd-J_hO8040ynjUolL0tcI3g2up52fJ_q2jnFYRqTv8xAOdLv2E7qIlj5ABLrDGGg3e3qPMaKnlZ9DSOrjeoWR7pb5Cx2t4RR7pJ_Oog8RnD0WrshFB2PA63_fkP3T4756SeqP59dqWydQlCIpeMdy3UqjTFuCQqHbrOS2zRmzrAAhbQudOklqo3kGOSs1Wm3QFEoYLTbk9u_2TNosfpjAH5oTcXMmFr-D9VKP</recordid><startdate>20210903</startdate><enddate>20210903</enddate><creator>Guo, Yingmei</creator><creator>Shou, Linjun</creator><creator>Pei, Jian</creator><creator>Gong, Ming</creator><creator>Xu, Mingxing</creator><creator>Wu, Zhiyong</creator><creator>Jiang, Daxin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210903</creationdate><title>Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding</title><author>Guo, Yingmei ; Shou, Linjun ; Pei, Jian ; Gong, Ming ; Xu, Mingxing ; Wu, Zhiyong ; Jiang, Daxin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-61f049c5b8bc7a8e39c271dc400d06a35dcaf8af8a59b912a4079ed9beb683b93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Guo, Yingmei</creatorcontrib><creatorcontrib>Shou, Linjun</creatorcontrib><creatorcontrib>Pei, Jian</creatorcontrib><creatorcontrib>Gong, Ming</creatorcontrib><creatorcontrib>Xu, Mingxing</creatorcontrib><creatorcontrib>Wu, Zhiyong</creatorcontrib><creatorcontrib>Jiang, Daxin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Guo, Yingmei</au><au>Shou, Linjun</au><au>Pei, Jian</au><au>Gong, Ming</au><au>Xu, Mingxing</au><au>Wu, Zhiyong</au><au>Jiang, Daxin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding</atitle><date>2021-09-03</date><risdate>2021</risdate><abstract>Lack of training data presents a grand challenge to scaling out spoken language understanding (SLU) to low-resource languages. Although various data augmentation approaches have been proposed to synthesize training data in low-resource target languages, the augmented data sets are often noisy, and thus impede the performance of SLU models. In this paper we focus on mitigating noise in augmented data. We develop a denoising training approach. Multiple models are trained with data produced by various augmented methods. Those models provide supervision signals to each other. The experimental results show that our method outperforms the existing state of the art by 3.05 and 4.24 percentage points on two benchmark datasets, respectively. The code will be made open sourced on github.</abstract><doi>10.48550/arxiv.2109.01583</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2109.01583
ispartof
issn
language eng
recordid cdi_arxiv_primary_2109_01583
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
title Learning from Multiple Noisy Augmented Data Sets for Better Cross-Lingual Spoken Language Understanding
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T04%3A18%3A04IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Learning%20from%20Multiple%20Noisy%20Augmented%20Data%20Sets%20for%20Better%20Cross-Lingual%20Spoken%20Language%20Understanding&rft.au=Guo,%20Yingmei&rft.date=2021-09-03&rft_id=info:doi/10.48550/arxiv.2109.01583&rft_dat=%3Carxiv_GOX%3E2109_01583%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true