Setting the Record Straight on Transformer Oversmoothing

Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inpu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Dovonon, Gbètondji J-S, Bronstein, Michael M, Kusner, Matt J
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Dovonon, Gbètondji J-S Bronstein, Michael M Kusner, Matt J
description	Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inputs become more and more similar. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. We hope that our findings give ML researchers and practitioners additional insight into how to develop future Transformer-based models.
doi_str_mv	10.48550/arxiv.2401.04301
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2401_04301</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2401_04301</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-adf8e277f10a7f0f4f2afcf5d99ef2d9479ea395bed26e14228f4a9398444c643</originalsourceid><addsrcrecordid>eNotj8tOwzAQRb1hgQofwAr_QIIfk9heooqXVKkSzT4a4pkmEomRY1Xw90BhdVb36B4hbrSqwTeNusP8OZ1qA0rXCqzSl8IfqJRpOcoyknylIeUoDyXjdByLTIvsMi4rpzxTlvsT5XVOqYw_gytxwfi-0vU_N6J7fOi2z9Vu__Syvd9V2DpdYWRPxjnWCh0rBjbIAzcxBGITA7hAaEPzRtG0pMEYz4DBBg8AQwt2I27_tOfr_UeeZsxf_W9Cf06w3_dEQYk</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Setting the Record Straight on Transformer Oversmoothing</title><source>arXiv.org</source><creator>Dovonon, Gbètondji J-S ; Bronstein, Michael M ; Kusner, Matt J</creator><creatorcontrib>Dovonon, Gbètondji J-S ; Bronstein, Michael M ; Kusner, Matt J</creatorcontrib><description>Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inputs become more and more similar. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. We hope that our findings give ML researchers and practitioners additional insight into how to develop future Transformer-based models.</description><identifier>DOI: 10.48550/arxiv.2401.04301</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2401.04301$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2401.04301$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Dovonon, Gbètondji J-S</creatorcontrib><creatorcontrib>Bronstein, Michael M</creatorcontrib><creatorcontrib>Kusner, Matt J</creatorcontrib><title>Setting the Record Straight on Transformer Oversmoothing</title><description>Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inputs become more and more similar. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. We hope that our findings give ML researchers and practitioners additional insight into how to develop future Transformer-based models.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAQRb1hgQofwAr_QIIfk9heooqXVKkSzT4a4pkmEomRY1Xw90BhdVb36B4hbrSqwTeNusP8OZ1qA0rXCqzSl8IfqJRpOcoyknylIeUoDyXjdByLTIvsMi4rpzxTlvsT5XVOqYw_gytxwfi-0vU_N6J7fOi2z9Vu__Syvd9V2DpdYWRPxjnWCh0rBjbIAzcxBGITA7hAaEPzRtG0pMEYz4DBBg8AQwt2I27_tOfr_UeeZsxf_W9Cf06w3_dEQYk</recordid><startdate>20240108</startdate><enddate>20240108</enddate><creator>Dovonon, Gbètondji J-S</creator><creator>Bronstein, Michael M</creator><creator>Kusner, Matt J</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240108</creationdate><title>Setting the Record Straight on Transformer Oversmoothing</title><author>Dovonon, Gbètondji J-S ; Bronstein, Michael M ; Kusner, Matt J</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-adf8e277f10a7f0f4f2afcf5d99ef2d9479ea395bed26e14228f4a9398444c643</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Dovonon, Gbètondji J-S</creatorcontrib><creatorcontrib>Bronstein, Michael M</creatorcontrib><creatorcontrib>Kusner, Matt J</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Dovonon, Gbètondji J-S</au><au>Bronstein, Michael M</au><au>Kusner, Matt J</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Setting the Record Straight on Transformer Oversmoothing</atitle><date>2024-01-08</date><risdate>2024</risdate><abstract>Transformer-based models have recently become wildly successful across a diverse set of domains. At the same time, recent work has shown empirically and theoretically that Transformers are inherently limited. Specifically, they argue that as model depth increases, Transformers oversmooth, i.e., inputs become more and more similar. A natural question is: How can Transformers achieve these successes given this shortcoming? In this work we test these observations empirically and theoretically and uncover a number of surprising findings. We find that there are cases where feature similarity increases but, contrary to prior results, this is not inevitable, even for existing pre-trained models. Theoretically, we show that smoothing behavior depends on the eigenspectrum of the value and projection weights. We verify this empirically and observe that the sign of layer normalization weights can influence this effect. Our analysis reveals a simple way to parameterize the weights of the Transformer update equations to influence smoothing behavior. We hope that our findings give ML researchers and practitioners additional insight into how to develop future Transformer-based models.</abstract><doi>10.48550/arxiv.2401.04301</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2401.04301
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2401_04301
source	arXiv.org
subjects	Computer Science - Learning
title	Setting the Record Straight on Transformer Oversmoothing
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T01%3A43%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Setting%20the%20Record%20Straight%20on%20Transformer%20Oversmoothing&rft.au=Dovonon,%20Gb%C3%A8tondji%20J-S&rft.date=2024-01-08&rft_id=info:doi/10.48550/arxiv.2401.04301&rft_dat=%3Carxiv_GOX%3E2401_04301%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true