Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to soft...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Ramapuram, Jason, Danieli, Federico, Dhekane, Eeshan, Weers, Floris, Busbridge, Dan, Ablin, Pierre, Likhomanenko, Tatiana, Digani, Jagrit, Gu, Zijin, Shidani, Amitis, Webb, Russ
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Ramapuram, Jason Danieli, Federico Dhekane, Eeshan Weers, Floris Busbridge, Dan Ablin, Pierre Likhomanenko, Tatiana Digani, Jagrit Gu, Zijin Shidani, Amitis Webb, Russ
description	Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.
doi_str_mv	10.48550/arxiv.2409.04431
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_04431</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_04431</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_044313</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DMwMTE25GSwD8lIzS-q1FFwzEvMqSzOLNZRSMxLUXBKLS5RCChKTC7JTE4tVkjLL1IIzkzPzc9MUQhOzUnTdSwpSc0ryczP42FgTUvMKU7lhdLcDPJuriHOHrpgq-ILijJzE4sq40FWxoOtNCasAgBaGzWa</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Theory, Analysis, and Best Practices for Sigmoid Self-Attention</title><source>arXiv.org</source><creator>Ramapuram, Jason ; Danieli, Federico ; Dhekane, Eeshan ; Weers, Floris ; Busbridge, Dan ; Ablin, Pierre ; Likhomanenko, Tatiana ; Digani, Jagrit ; Gu, Zijin ; Shidani, Amitis ; Webb, Russ</creator><creatorcontrib>Ramapuram, Jason ; Danieli, Federico ; Dhekane, Eeshan ; Weers, Floris ; Busbridge, Dan ; Ablin, Pierre ; Likhomanenko, Tatiana ; Digani, Jagrit ; Gu, Zijin ; Shidani, Amitis ; Webb, Russ</creatorcontrib><description>Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.</description><identifier>DOI: 10.48550/arxiv.2409.04431</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.04431$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.04431$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Ramapuram, Jason</creatorcontrib><creatorcontrib>Danieli, Federico</creatorcontrib><creatorcontrib>Dhekane, Eeshan</creatorcontrib><creatorcontrib>Weers, Floris</creatorcontrib><creatorcontrib>Busbridge, Dan</creatorcontrib><creatorcontrib>Ablin, Pierre</creatorcontrib><creatorcontrib>Likhomanenko, Tatiana</creatorcontrib><creatorcontrib>Digani, Jagrit</creatorcontrib><creatorcontrib>Gu, Zijin</creatorcontrib><creatorcontrib>Shidani, Amitis</creatorcontrib><creatorcontrib>Webb, Russ</creatorcontrib><title>Theory, Analysis, and Best Practices for Sigmoid Self-Attention</title><description>Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DMwMTE25GSwD8lIzS-q1FFwzEvMqSzOLNZRSMxLUXBKLS5RCChKTC7JTE4tVkjLL1IIzkzPzc9MUQhOzUnTdSwpSc0ryczP42FgTUvMKU7lhdLcDPJuriHOHrpgq-ILijJzE4sq40FWxoOtNCasAgBaGzWa</recordid><startdate>20240906</startdate><enddate>20240906</enddate><creator>Ramapuram, Jason</creator><creator>Danieli, Federico</creator><creator>Dhekane, Eeshan</creator><creator>Weers, Floris</creator><creator>Busbridge, Dan</creator><creator>Ablin, Pierre</creator><creator>Likhomanenko, Tatiana</creator><creator>Digani, Jagrit</creator><creator>Gu, Zijin</creator><creator>Shidani, Amitis</creator><creator>Webb, Russ</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240906</creationdate><title>Theory, Analysis, and Best Practices for Sigmoid Self-Attention</title><author>Ramapuram, Jason ; Danieli, Federico ; Dhekane, Eeshan ; Weers, Floris ; Busbridge, Dan ; Ablin, Pierre ; Likhomanenko, Tatiana ; Digani, Jagrit ; Gu, Zijin ; Shidani, Amitis ; Webb, Russ</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_044313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Ramapuram, Jason</creatorcontrib><creatorcontrib>Danieli, Federico</creatorcontrib><creatorcontrib>Dhekane, Eeshan</creatorcontrib><creatorcontrib>Weers, Floris</creatorcontrib><creatorcontrib>Busbridge, Dan</creatorcontrib><creatorcontrib>Ablin, Pierre</creatorcontrib><creatorcontrib>Likhomanenko, Tatiana</creatorcontrib><creatorcontrib>Digani, Jagrit</creatorcontrib><creatorcontrib>Gu, Zijin</creatorcontrib><creatorcontrib>Shidani, Amitis</creatorcontrib><creatorcontrib>Webb, Russ</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ramapuram, Jason</au><au>Danieli, Federico</au><au>Dhekane, Eeshan</au><au>Weers, Floris</au><au>Busbridge, Dan</au><au>Ablin, Pierre</au><au>Likhomanenko, Tatiana</au><au>Digani, Jagrit</au><au>Gu, Zijin</au><au>Shidani, Amitis</au><au>Webb, Russ</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Theory, Analysis, and Best Practices for Sigmoid Self-Attention</atitle><date>2024-09-06</date><risdate>2024</risdate><abstract>Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.</abstract><doi>10.48550/arxiv.2409.04431</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.04431
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_04431
source	arXiv.org
subjects	Computer Science - Learning
title	Theory, Analysis, and Best Practices for Sigmoid Self-Attention
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T13%3A35%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Theory,%20Analysis,%20and%20Best%20Practices%20for%20Sigmoid%20Self-Attention&rft.au=Ramapuram,%20Jason&rft.date=2024-09-06&rft_id=info:doi/10.48550/arxiv.2409.04431&rft_dat=%3Carxiv_GOX%3E2409_04431%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true