Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems
In this work, we study rapid, step-wise improvements of the loss in transformers when being confronted with multi-step decision tasks. We found that transformers struggle to learn the intermediate tasks, whereas CNNs have no such issue on the tasks we studied. When transformers learn the intermediat...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2023-10 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Hoffmann, David T Schrodi, Simon Behrmann, Nadine Fischer, Volker Brox, Thomas |
description | In this work, we study rapid, step-wise improvements of the loss in transformers when being confronted with multi-step decision tasks. We found that transformers struggle to learn the intermediate tasks, whereas CNNs have no such issue on the tasks we studied. When transformers learn the intermediate task, they do this rapidly and unexpectedly after both training and validation loss saturated for hundreds of epochs. We call these rapid improvements Eureka-moments, since the transformer appears to suddenly learn a previously incomprehensible task. Similar leaps in performance have become known as Grokking. In contrast to Grokking, for Eureka-moments, both the validation and the training loss saturate before rapidly improving. We trace the problem back to the Softmax function in the self-attention block of transformers and show ways to alleviate the problem. These fixes improve training speed. The improved models reach 95% of the baseline model in just 20% of training steps while having a much higher likelihood to learn the intermediate task, lead to higher final accuracy and are more robust to hyper-parameters. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2879442114</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2879442114</sourcerecordid><originalsourceid>FETCH-proquest_journals_28794421143</originalsourceid><addsrcrecordid>eNqNyrEKwjAUQNEgCIr2Hx44F9q0tdVVFB1EsV2lRPsK0SapeYmIX6-DH-B0h3MHbMyTJA6LlPMRC4huURTxec6zLBmz89pbvItwbxRqRyA1VFZoao1VaGkJe985GZYOe6gE3QlO-ETRQWlap8QLdrrxV2zg0Dup5Fs4aTQcrbl0qGjKhq3oCINfJ2y2WVerbdhb8_BIrr4Zb_WXal7kizTlcZwm_10fGw5D_A</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2879442114</pqid></control><display><type>article</type><title>Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems</title><source>Free E- Journals</source><creator>Hoffmann, David T ; Schrodi, Simon ; Behrmann, Nadine ; Fischer, Volker ; Brox, Thomas</creator><creatorcontrib>Hoffmann, David T ; Schrodi, Simon ; Behrmann, Nadine ; Fischer, Volker ; Brox, Thomas</creatorcontrib><description>In this work, we study rapid, step-wise improvements of the loss in transformers when being confronted with multi-step decision tasks. We found that transformers struggle to learn the intermediate tasks, whereas CNNs have no such issue on the tasks we studied. When transformers learn the intermediate task, they do this rapidly and unexpectedly after both training and validation loss saturated for hundreds of epochs. We call these rapid improvements Eureka-moments, since the transformer appears to suddenly learn a previously incomprehensible task. Similar leaps in performance have become known as Grokking. In contrast to Grokking, for Eureka-moments, both the validation and the training loss saturate before rapidly improving. We trace the problem back to the Softmax function in the self-attention block of transformers and show ways to alleviate the problem. These fixes improve training speed. The improved models reach 95% of the baseline model in just 20% of training steps while having a much higher likelihood to learn the intermediate task, lead to higher final accuracy and are more robust to hyper-parameters.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Optimization ; Parameter robustness ; Training ; Transformers</subject><ispartof>arXiv.org, 2023-10</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Hoffmann, David T</creatorcontrib><creatorcontrib>Schrodi, Simon</creatorcontrib><creatorcontrib>Behrmann, Nadine</creatorcontrib><creatorcontrib>Fischer, Volker</creatorcontrib><creatorcontrib>Brox, Thomas</creatorcontrib><title>Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems</title><title>arXiv.org</title><description>In this work, we study rapid, step-wise improvements of the loss in transformers when being confronted with multi-step decision tasks. We found that transformers struggle to learn the intermediate tasks, whereas CNNs have no such issue on the tasks we studied. When transformers learn the intermediate task, they do this rapidly and unexpectedly after both training and validation loss saturated for hundreds of epochs. We call these rapid improvements Eureka-moments, since the transformer appears to suddenly learn a previously incomprehensible task. Similar leaps in performance have become known as Grokking. In contrast to Grokking, for Eureka-moments, both the validation and the training loss saturate before rapidly improving. We trace the problem back to the Softmax function in the self-attention block of transformers and show ways to alleviate the problem. These fixes improve training speed. The improved models reach 95% of the baseline model in just 20% of training steps while having a much higher likelihood to learn the intermediate task, lead to higher final accuracy and are more robust to hyper-parameters.</description><subject>Optimization</subject><subject>Parameter robustness</subject><subject>Training</subject><subject>Transformers</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNyrEKwjAUQNEgCIr2Hx44F9q0tdVVFB1EsV2lRPsK0SapeYmIX6-DH-B0h3MHbMyTJA6LlPMRC4huURTxec6zLBmz89pbvItwbxRqRyA1VFZoao1VaGkJe985GZYOe6gE3QlO-ETRQWlap8QLdrrxV2zg0Dup5Fs4aTQcrbl0qGjKhq3oCINfJ2y2WVerbdhb8_BIrr4Zb_WXal7kizTlcZwm_10fGw5D_A</recordid><startdate>20231019</startdate><enddate>20231019</enddate><creator>Hoffmann, David T</creator><creator>Schrodi, Simon</creator><creator>Behrmann, Nadine</creator><creator>Fischer, Volker</creator><creator>Brox, Thomas</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231019</creationdate><title>Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems</title><author>Hoffmann, David T ; Schrodi, Simon ; Behrmann, Nadine ; Fischer, Volker ; Brox, Thomas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28794421143</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Optimization</topic><topic>Parameter robustness</topic><topic>Training</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Hoffmann, David T</creatorcontrib><creatorcontrib>Schrodi, Simon</creatorcontrib><creatorcontrib>Behrmann, Nadine</creatorcontrib><creatorcontrib>Fischer, Volker</creatorcontrib><creatorcontrib>Brox, Thomas</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hoffmann, David T</au><au>Schrodi, Simon</au><au>Behrmann, Nadine</au><au>Fischer, Volker</au><au>Brox, Thomas</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems</atitle><jtitle>arXiv.org</jtitle><date>2023-10-19</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>In this work, we study rapid, step-wise improvements of the loss in transformers when being confronted with multi-step decision tasks. We found that transformers struggle to learn the intermediate tasks, whereas CNNs have no such issue on the tasks we studied. When transformers learn the intermediate task, they do this rapidly and unexpectedly after both training and validation loss saturated for hundreds of epochs. We call these rapid improvements Eureka-moments, since the transformer appears to suddenly learn a previously incomprehensible task. Similar leaps in performance have become known as Grokking. In contrast to Grokking, for Eureka-moments, both the validation and the training loss saturate before rapidly improving. We trace the problem back to the Softmax function in the self-attention block of transformers and show ways to alleviate the problem. These fixes improve training speed. The improved models reach 95% of the baseline model in just 20% of training steps while having a much higher likelihood to learn the intermediate task, lead to higher final accuracy and are more robust to hyper-parameters.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2023-10 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2879442114 |
source | Free E- Journals |
subjects | Optimization Parameter robustness Training Transformers |
title | Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T22%3A39%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Eureka-Moments%20in%20Transformers:%20Multi-Step%20Tasks%20Reveal%20Softmax%20Induced%20Optimization%20Problems&rft.jtitle=arXiv.org&rft.au=Hoffmann,%20David%20T&rft.date=2023-10-19&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2879442114%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2879442114&rft_id=info:pmid/&rfr_iscdi=true |