Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling

Neural autoregressive sequence models smear the probability among many possible sequences including degenerate ones, such as empty or repetitive sequences. In this work, we tackle one specific case where the model assigns a high probability to unreasonably short sequences. We define the oversmoothin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Kulikov, Ilia, Eremeev, Maksim, Cho, Kyunghyun
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Kulikov, Ilia Eremeev, Maksim Cho, Kyunghyun
description	Neural autoregressive sequence models smear the probability among many possible sequences including degenerate ones, such as empty or repetitive sequences. In this work, we tackle one specific case where the model assigns a high probability to unreasonably short sequences. We define the oversmoothing rate to quantify this issue. After confirming the high degree of oversmoothing in neural machine translation, we propose to explicitly minimize the oversmoothing rate during training. We conduct a set of experiments to study the effect of the proposed regularization on both model distribution and decoding performance. We use a neural machine translation task as the testbed and consider three different datasets of varying size. Our experiments reveal three major findings. First, we can control the oversmoothing rate of the model by tuning the strength of the regularization. Second, by enhancing the oversmoothing loss contribution, the probability and the rank of token decrease heavily at positions where it is not supposed to be. Third, the proposed regularization impacts the outcome of beam search especially when a large beam is used. The degradation of translation quality (measured in BLEU) with a large beam significantly lessens with lower oversmoothing rate, but the degradation compared to smaller beam sizes remains to exist. From these observations, we conclude that the high degree of oversmoothing is the main reason behind the degenerate case of overly probable short sequences in a neural autoregressive model.
doi_str_mv	10.48550/arxiv.2112.08914
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2112_08914</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2112_08914</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-f246aaba137e1701eb3b1ffdb785c872c4a69f558169ba33fe69342dcd6871d83</originalsourceid><addsrcrecordid>eNotj8tOhDAYhbtxYUYfwNX0BUBKS1uWhnhLJnEze_LT_h2aANUWiPr0Cro6OTmX5CPkjhW50FVV3EP89GteMlbmha6ZuCam6SGCmTH6bz9dKEyWgrURU9rs3CP1KS1Ig6NhxZjGEOZ-i_xEJ1wiDBSWOUS87JsVacKPBSeDdAwWh9_qDblyMCS8_dcDOT89npuX7PT2_No8nDKQSmSuFBKgA8YVMlUw7HjHnLOd0pXRqjQCZO2qSjNZd8C5Q1lzUVpjpVbMan4gx7_bnbJ9j36E-NVutO1Oy38AK6ZSVQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling</title><source>arXiv.org</source><creator>Kulikov, Ilia ; Eremeev, Maksim ; Cho, Kyunghyun</creator><creatorcontrib>Kulikov, Ilia ; Eremeev, Maksim ; Cho, Kyunghyun</creatorcontrib><description>Neural autoregressive sequence models smear the probability among many possible sequences including degenerate ones, such as empty or repetitive sequences. In this work, we tackle one specific case where the model assigns a high probability to unreasonably short sequences. We define the oversmoothing rate to quantify this issue. After confirming the high degree of oversmoothing in neural machine translation, we propose to explicitly minimize the oversmoothing rate during training. We conduct a set of experiments to study the effect of the proposed regularization on both model distribution and decoding performance. We use a neural machine translation task as the testbed and consider three different datasets of varying size. Our experiments reveal three major findings. First, we can control the oversmoothing rate of the model by tuning the strength of the regularization. Second, by enhancing the oversmoothing loss contribution, the probability and the rank of token decrease heavily at positions where it is not supposed to be. Third, the proposed regularization impacts the outcome of beam search especially when a large beam is used. The degradation of translation quality (measured in BLEU) with a large beam significantly lessens with lower oversmoothing rate, but the degradation compared to smaller beam sizes remains to exist. From these observations, we conclude that the high degree of oversmoothing is the main reason behind the degenerate case of overly probable short sequences in a neural autoregressive model.</description><identifier>DOI: 10.48550/arxiv.2112.08914</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2021-12</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2112.08914$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2112.08914$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kulikov, Ilia</creatorcontrib><creatorcontrib>Eremeev, Maksim</creatorcontrib><creatorcontrib>Cho, Kyunghyun</creatorcontrib><title>Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling</title><description>Neural autoregressive sequence models smear the probability among many possible sequences including degenerate ones, such as empty or repetitive sequences. In this work, we tackle one specific case where the model assigns a high probability to unreasonably short sequences. We define the oversmoothing rate to quantify this issue. After confirming the high degree of oversmoothing in neural machine translation, we propose to explicitly minimize the oversmoothing rate during training. We conduct a set of experiments to study the effect of the proposed regularization on both model distribution and decoding performance. We use a neural machine translation task as the testbed and consider three different datasets of varying size. Our experiments reveal three major findings. First, we can control the oversmoothing rate of the model by tuning the strength of the regularization. Second, by enhancing the oversmoothing loss contribution, the probability and the rank of token decrease heavily at positions where it is not supposed to be. Third, the proposed regularization impacts the outcome of beam search especially when a large beam is used. The degradation of translation quality (measured in BLEU) with a large beam significantly lessens with lower oversmoothing rate, but the degradation compared to smaller beam sizes remains to exist. From these observations, we conclude that the high degree of oversmoothing is the main reason behind the degenerate case of overly probable short sequences in a neural autoregressive model.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOhDAYhbtxYUYfwNX0BUBKS1uWhnhLJnEze_LT_h2aANUWiPr0Cro6OTmX5CPkjhW50FVV3EP89GteMlbmha6ZuCam6SGCmTH6bz9dKEyWgrURU9rs3CP1KS1Ig6NhxZjGEOZ-i_xEJ1wiDBSWOUS87JsVacKPBSeDdAwWh9_qDblyMCS8_dcDOT89npuX7PT2_No8nDKQSmSuFBKgA8YVMlUw7HjHnLOd0pXRqjQCZO2qSjNZd8C5Q1lzUVpjpVbMan4gx7_bnbJ9j36E-NVutO1Oy38AK6ZSVQ</recordid><startdate>20211216</startdate><enddate>20211216</enddate><creator>Kulikov, Ilia</creator><creator>Eremeev, Maksim</creator><creator>Cho, Kyunghyun</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20211216</creationdate><title>Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling</title><author>Kulikov, Ilia ; Eremeev, Maksim ; Cho, Kyunghyun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-f246aaba137e1701eb3b1ffdb785c872c4a69f558169ba33fe69342dcd6871d83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Kulikov, Ilia</creatorcontrib><creatorcontrib>Eremeev, Maksim</creatorcontrib><creatorcontrib>Cho, Kyunghyun</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kulikov, Ilia</au><au>Eremeev, Maksim</au><au>Cho, Kyunghyun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling</atitle><date>2021-12-16</date><risdate>2021</risdate><abstract>Neural autoregressive sequence models smear the probability among many possible sequences including degenerate ones, such as empty or repetitive sequences. In this work, we tackle one specific case where the model assigns a high probability to unreasonably short sequences. We define the oversmoothing rate to quantify this issue. After confirming the high degree of oversmoothing in neural machine translation, we propose to explicitly minimize the oversmoothing rate during training. We conduct a set of experiments to study the effect of the proposed regularization on both model distribution and decoding performance. We use a neural machine translation task as the testbed and consider three different datasets of varying size. Our experiments reveal three major findings. First, we can control the oversmoothing rate of the model by tuning the strength of the regularization. Second, by enhancing the oversmoothing loss contribution, the probability and the rank of token decrease heavily at positions where it is not supposed to be. Third, the proposed regularization impacts the outcome of beam search especially when a large beam is used. The degradation of translation quality (measured in BLEU) with a large beam significantly lessens with lower oversmoothing rate, but the degradation compared to smaller beam sizes remains to exist. From these observations, we conclude that the high degree of oversmoothing is the main reason behind the degenerate case of overly probable short sequences in a neural autoregressive model.</abstract><doi>10.48550/arxiv.2112.08914</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2112.08914
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2112_08914
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Learning
title	Characterizing and addressing the issue of oversmoothing in neural autoregressive sequence modeling
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T12%3A52%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Characterizing%20and%20addressing%20the%20issue%20of%20oversmoothing%20in%20neural%20autoregressive%20sequence%20modeling&rft.au=Kulikov,%20Ilia&rft.date=2021-12-16&rft_id=info:doi/10.48550/arxiv.2112.08914&rft_dat=%3Carxiv_GOX%3E2112_08914%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true