Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation

Recently, non-autoregressive (NAT) models predict outputs in parallel, achieving substantial improvements in generation speed compared to autoregressive (AT) models. While performing worse on raw data, most NAT models are trained as student models on distilled data generated by AT teacher models, wh...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Guo, Jiaxin, Wang, Minghan, Wei, Daimeng, Shang, Hengchao, Wang, Yuxia, Li, Zongyao, Yu, Zhengzhe, Wu, Zhanglin, Chen, Yimeng, Su, Chang, Zhang, Min, Lei, Lizhi, tao, shimin, Yang, Hao
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Guo, Jiaxin Wang, Minghan Wei, Daimeng Shang, Hengchao Wang, Yuxia Li, Zongyao Yu, Zhengzhe Wu, Zhanglin Chen, Yimeng Su, Chang Zhang, Min Lei, Lizhi tao, shimin Yang, Hao
description	Recently, non-autoregressive (NAT) models predict outputs in parallel, achieving substantial improvements in generation speed compared to autoregressive (AT) models. While performing worse on raw data, most NAT models are trained as student models on distilled data generated by AT teacher models, which is known as sequence-level Knowledge Distillation. An effective training strategy to improve the performance of AT models is Self-Distillation Mixup (SDM) Training, which pre-trains a model on raw data, generates distilled data by the pre-trained model itself and finally re-trains a model on the combination of raw data and distilled data. In this work, we aim to view SDM for NAT models, but find directly adopting SDM to NAT models gains no improvements in terms of translation quality. Through careful analysis, we observe the invalidation is correlated to Modeling Diversity and Confirmation Bias between the AT teacher model and the NAT student models. Based on these findings, we propose an enhanced strategy named SDMRT by adding two stages to classic SDM: one is Pre-Rerank on self-distilled data, the other is Fine-Tune on Filtered teacher-distilled data. Our results outperform baselines by 0.6 to 1.2 BLEU on multiple NAT models. As another bonus, for Iterative Refinement NAT models, our methods can outperform baselines within half iteration number, which means 2X acceleration.
doi_str_mv	10.48550/arxiv.2112.11640
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2112_11640</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2112_11640</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-113781f45f54b6472ea689e01aad677496c015cf486aac9180c014b90ecb7da13</originalsourceid><addsrcrecordid>eNotj8lqwzAURbXpoqT9gK6iH5CrZ2vysqQjJCk03ptnR0oFqhwkO6R_32ZYXQ5cDhxCHoAXwkjJHzEd_aEoAcoCQAl-S742Njj27PPoQ8DRD5Gu_HHa0yahjz7uqBsSXQ-R4TQOye6SzdkfLF3bKWGgK-y_fbSne8wXwR25cRiyvb_ujDSvL83inS0_3z4WT0uGSnMGUGkDTkgnRaeELi0qU1sOiFultahVz0H2ThiF2Ndg-D-Lrua27_QWoZqR-UV7jmr3yf9g-m1Pce05rvoDJLtKbA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation</title><source>arXiv.org</source><creator>Guo, Jiaxin ; Wang, Minghan ; Wei, Daimeng ; Shang, Hengchao ; Wang, Yuxia ; Li, Zongyao ; Yu, Zhengzhe ; Wu, Zhanglin ; Chen, Yimeng ; Su, Chang ; Zhang, Min ; Lei, Lizhi ; tao, shimin ; Yang, Hao</creator><creatorcontrib>Guo, Jiaxin ; Wang, Minghan ; Wei, Daimeng ; Shang, Hengchao ; Wang, Yuxia ; Li, Zongyao ; Yu, Zhengzhe ; Wu, Zhanglin ; Chen, Yimeng ; Su, Chang ; Zhang, Min ; Lei, Lizhi ; tao, shimin ; Yang, Hao</creatorcontrib><description>Recently, non-autoregressive (NAT) models predict outputs in parallel, achieving substantial improvements in generation speed compared to autoregressive (AT) models. While performing worse on raw data, most NAT models are trained as student models on distilled data generated by AT teacher models, which is known as sequence-level Knowledge Distillation. An effective training strategy to improve the performance of AT models is Self-Distillation Mixup (SDM) Training, which pre-trains a model on raw data, generates distilled data by the pre-trained model itself and finally re-trains a model on the combination of raw data and distilled data. In this work, we aim to view SDM for NAT models, but find directly adopting SDM to NAT models gains no improvements in terms of translation quality. Through careful analysis, we observe the invalidation is correlated to Modeling Diversity and Confirmation Bias between the AT teacher model and the NAT student models. Based on these findings, we propose an enhanced strategy named SDMRT by adding two stages to classic SDM: one is Pre-Rerank on self-distilled data, the other is Fine-Tune on Filtered teacher-distilled data. Our results outperform baselines by 0.6 to 1.2 BLEU on multiple NAT models. As another bonus, for Iterative Refinement NAT models, our methods can outperform baselines within half iteration number, which means 2X acceleration.</description><identifier>DOI: 10.48550/arxiv.2112.11640</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language</subject><creationdate>2021-12</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2112.11640$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2112.11640$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Guo, Jiaxin</creatorcontrib><creatorcontrib>Wang, Minghan</creatorcontrib><creatorcontrib>Wei, Daimeng</creatorcontrib><creatorcontrib>Shang, Hengchao</creatorcontrib><creatorcontrib>Wang, Yuxia</creatorcontrib><creatorcontrib>Li, Zongyao</creatorcontrib><creatorcontrib>Yu, Zhengzhe</creatorcontrib><creatorcontrib>Wu, Zhanglin</creatorcontrib><creatorcontrib>Chen, Yimeng</creatorcontrib><creatorcontrib>Su, Chang</creatorcontrib><creatorcontrib>Zhang, Min</creatorcontrib><creatorcontrib>Lei, Lizhi</creatorcontrib><creatorcontrib>tao, shimin</creatorcontrib><creatorcontrib>Yang, Hao</creatorcontrib><title>Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation</title><description>Recently, non-autoregressive (NAT) models predict outputs in parallel, achieving substantial improvements in generation speed compared to autoregressive (AT) models. While performing worse on raw data, most NAT models are trained as student models on distilled data generated by AT teacher models, which is known as sequence-level Knowledge Distillation. An effective training strategy to improve the performance of AT models is Self-Distillation Mixup (SDM) Training, which pre-trains a model on raw data, generates distilled data by the pre-trained model itself and finally re-trains a model on the combination of raw data and distilled data. In this work, we aim to view SDM for NAT models, but find directly adopting SDM to NAT models gains no improvements in terms of translation quality. Through careful analysis, we observe the invalidation is correlated to Modeling Diversity and Confirmation Bias between the AT teacher model and the NAT student models. Based on these findings, we propose an enhanced strategy named SDMRT by adding two stages to classic SDM: one is Pre-Rerank on self-distilled data, the other is Fine-Tune on Filtered teacher-distilled data. Our results outperform baselines by 0.6 to 1.2 BLEU on multiple NAT models. As another bonus, for Iterative Refinement NAT models, our methods can outperform baselines within half iteration number, which means 2X acceleration.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8lqwzAURbXpoqT9gK6iH5CrZ2vysqQjJCk03ptnR0oFqhwkO6R_32ZYXQ5cDhxCHoAXwkjJHzEd_aEoAcoCQAl-S742Njj27PPoQ8DRD5Gu_HHa0yahjz7uqBsSXQ-R4TQOye6SzdkfLF3bKWGgK-y_fbSne8wXwR25cRiyvb_ujDSvL83inS0_3z4WT0uGSnMGUGkDTkgnRaeELi0qU1sOiFultahVz0H2ThiF2Ndg-D-Lrua27_QWoZqR-UV7jmr3yf9g-m1Pce05rvoDJLtKbA</recordid><startdate>20211221</startdate><enddate>20211221</enddate><creator>Guo, Jiaxin</creator><creator>Wang, Minghan</creator><creator>Wei, Daimeng</creator><creator>Shang, Hengchao</creator><creator>Wang, Yuxia</creator><creator>Li, Zongyao</creator><creator>Yu, Zhengzhe</creator><creator>Wu, Zhanglin</creator><creator>Chen, Yimeng</creator><creator>Su, Chang</creator><creator>Zhang, Min</creator><creator>Lei, Lizhi</creator><creator>tao, shimin</creator><creator>Yang, Hao</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20211221</creationdate><title>Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation</title><author>Guo, Jiaxin ; Wang, Minghan ; Wei, Daimeng ; Shang, Hengchao ; Wang, Yuxia ; Li, Zongyao ; Yu, Zhengzhe ; Wu, Zhanglin ; Chen, Yimeng ; Su, Chang ; Zhang, Min ; Lei, Lizhi ; tao, shimin ; Yang, Hao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-113781f45f54b6472ea689e01aad677496c015cf486aac9180c014b90ecb7da13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Guo, Jiaxin</creatorcontrib><creatorcontrib>Wang, Minghan</creatorcontrib><creatorcontrib>Wei, Daimeng</creatorcontrib><creatorcontrib>Shang, Hengchao</creatorcontrib><creatorcontrib>Wang, Yuxia</creatorcontrib><creatorcontrib>Li, Zongyao</creatorcontrib><creatorcontrib>Yu, Zhengzhe</creatorcontrib><creatorcontrib>Wu, Zhanglin</creatorcontrib><creatorcontrib>Chen, Yimeng</creatorcontrib><creatorcontrib>Su, Chang</creatorcontrib><creatorcontrib>Zhang, Min</creatorcontrib><creatorcontrib>Lei, Lizhi</creatorcontrib><creatorcontrib>tao, shimin</creatorcontrib><creatorcontrib>Yang, Hao</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Guo, Jiaxin</au><au>Wang, Minghan</au><au>Wei, Daimeng</au><au>Shang, Hengchao</au><au>Wang, Yuxia</au><au>Li, Zongyao</au><au>Yu, Zhengzhe</au><au>Wu, Zhanglin</au><au>Chen, Yimeng</au><au>Su, Chang</au><au>Zhang, Min</au><au>Lei, Lizhi</au><au>tao, shimin</au><au>Yang, Hao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation</atitle><date>2021-12-21</date><risdate>2021</risdate><abstract>Recently, non-autoregressive (NAT) models predict outputs in parallel, achieving substantial improvements in generation speed compared to autoregressive (AT) models. While performing worse on raw data, most NAT models are trained as student models on distilled data generated by AT teacher models, which is known as sequence-level Knowledge Distillation. An effective training strategy to improve the performance of AT models is Self-Distillation Mixup (SDM) Training, which pre-trains a model on raw data, generates distilled data by the pre-trained model itself and finally re-trains a model on the combination of raw data and distilled data. In this work, we aim to view SDM for NAT models, but find directly adopting SDM to NAT models gains no improvements in terms of translation quality. Through careful analysis, we observe the invalidation is correlated to Modeling Diversity and Confirmation Bias between the AT teacher model and the NAT student models. Based on these findings, we propose an enhanced strategy named SDMRT by adding two stages to classic SDM: one is Pre-Rerank on self-distilled data, the other is Fine-Tune on Filtered teacher-distilled data. Our results outperform baselines by 0.6 to 1.2 BLEU on multiple NAT models. As another bonus, for Iterative Refinement NAT models, our methods can outperform baselines within half iteration number, which means 2X acceleration.</abstract><doi>10.48550/arxiv.2112.11640</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2112.11640
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2112_11640
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language
title	Self-Distillation Mixup Training for Non-autoregressive Neural Machine Translation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T23%3A26%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Self-Distillation%20Mixup%20Training%20for%20Non-autoregressive%20Neural%20Machine%20Translation&rft.au=Guo,%20Jiaxin&rft.date=2021-12-21&rft_id=info:doi/10.48550/arxiv.2112.11640&rft_dat=%3Carxiv_GOX%3E2112_11640%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true