Theory, Analysis, and Best Practices for Sigmoid Self-Attention
Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to soft...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Ramapuram, Jason Danieli, Federico Dhekane, Eeshan Weers, Floris Busbridge, Dan Ablin, Pierre Likhomanenko, Tatiana Digani, Jagrit Gu, Zijin Shidani, Amitis Webb, Russ |
description | Attention is a key part of the transformer architecture. It is a
sequence-to-sequence mapping that transforms each sequence element into a
weighted sum of values. The weights are typically obtained as the softmax of
dot products between keys and queries. Recent work has explored alternatives to
softmax attention in transformers, such as ReLU and sigmoid activations. In
this work, we revisit sigmoid attention and conduct an in-depth theoretical and
empirical analysis. Theoretically, we prove that transformers with sigmoid
attention are universal function approximators and benefit from improved
regularity compared to softmax attention. Through detailed empirical analysis,
we identify stabilization of large initial attention norms during the early
stages of training as a crucial factor for the successful training of models
with sigmoid attention, outperforming prior attempts. We also introduce
FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid
attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100
GPUs. Experiments across language, vision, and speech show that properly
normalized sigmoid attention matches the strong performance of softmax
attention on a wide range of domains and scales, which previous attempts at
sigmoid attention were unable to fully achieve. Our work unifies prior art and
establishes best practices for sigmoid attention as a drop-in softmax
replacement in transformers. |
doi_str_mv | 10.48550/arxiv.2409.04431 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_04431</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_04431</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_044313</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DMwMTE25GSwD8lIzS-q1FFwzEvMqSzOLNZRSMxLUXBKLS5RCChKTC7JTE4tVkjLL1IIzkzPzc9MUQhOzUnTdSwpSc0ryczP42FgTUvMKU7lhdLcDPJuriHOHrpgq-ILijJzE4sq40FWxoOtNCasAgBaGzWa</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Theory, Analysis, and Best Practices for Sigmoid Self-Attention</title><source>arXiv.org</source><creator>Ramapuram, Jason ; Danieli, Federico ; Dhekane, Eeshan ; Weers, Floris ; Busbridge, Dan ; Ablin, Pierre ; Likhomanenko, Tatiana ; Digani, Jagrit ; Gu, Zijin ; Shidani, Amitis ; Webb, Russ</creator><creatorcontrib>Ramapuram, Jason ; Danieli, Federico ; Dhekane, Eeshan ; Weers, Floris ; Busbridge, Dan ; Ablin, Pierre ; Likhomanenko, Tatiana ; Digani, Jagrit ; Gu, Zijin ; Shidani, Amitis ; Webb, Russ</creatorcontrib><description>Attention is a key part of the transformer architecture. It is a
sequence-to-sequence mapping that transforms each sequence element into a
weighted sum of values. The weights are typically obtained as the softmax of
dot products between keys and queries. Recent work has explored alternatives to
softmax attention in transformers, such as ReLU and sigmoid activations. In
this work, we revisit sigmoid attention and conduct an in-depth theoretical and
empirical analysis. Theoretically, we prove that transformers with sigmoid
attention are universal function approximators and benefit from improved
regularity compared to softmax attention. Through detailed empirical analysis,
we identify stabilization of large initial attention norms during the early
stages of training as a crucial factor for the successful training of models
with sigmoid attention, outperforming prior attempts. We also introduce
FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid
attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100
GPUs. Experiments across language, vision, and speech show that properly
normalized sigmoid attention matches the strong performance of softmax
attention on a wide range of domains and scales, which previous attempts at
sigmoid attention were unable to fully achieve. Our work unifies prior art and
establishes best practices for sigmoid attention as a drop-in softmax
replacement in transformers.</description><identifier>DOI: 10.48550/arxiv.2409.04431</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.04431$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.04431$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Ramapuram, Jason</creatorcontrib><creatorcontrib>Danieli, Federico</creatorcontrib><creatorcontrib>Dhekane, Eeshan</creatorcontrib><creatorcontrib>Weers, Floris</creatorcontrib><creatorcontrib>Busbridge, Dan</creatorcontrib><creatorcontrib>Ablin, Pierre</creatorcontrib><creatorcontrib>Likhomanenko, Tatiana</creatorcontrib><creatorcontrib>Digani, Jagrit</creatorcontrib><creatorcontrib>Gu, Zijin</creatorcontrib><creatorcontrib>Shidani, Amitis</creatorcontrib><creatorcontrib>Webb, Russ</creatorcontrib><title>Theory, Analysis, and Best Practices for Sigmoid Self-Attention</title><description>Attention is a key part of the transformer architecture. It is a
sequence-to-sequence mapping that transforms each sequence element into a
weighted sum of values. The weights are typically obtained as the softmax of
dot products between keys and queries. Recent work has explored alternatives to
softmax attention in transformers, such as ReLU and sigmoid activations. In
this work, we revisit sigmoid attention and conduct an in-depth theoretical and
empirical analysis. Theoretically, we prove that transformers with sigmoid
attention are universal function approximators and benefit from improved
regularity compared to softmax attention. Through detailed empirical analysis,
we identify stabilization of large initial attention norms during the early
stages of training as a crucial factor for the successful training of models
with sigmoid attention, outperforming prior attempts. We also introduce
FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid
attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100
GPUs. Experiments across language, vision, and speech show that properly
normalized sigmoid attention matches the strong performance of softmax
attention on a wide range of domains and scales, which previous attempts at
sigmoid attention were unable to fully achieve. Our work unifies prior art and
establishes best practices for sigmoid attention as a drop-in softmax
replacement in transformers.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DMwMTE25GSwD8lIzS-q1FFwzEvMqSzOLNZRSMxLUXBKLS5RCChKTC7JTE4tVkjLL1IIzkzPzc9MUQhOzUnTdSwpSc0ryczP42FgTUvMKU7lhdLcDPJuriHOHrpgq-ILijJzE4sq40FWxoOtNCasAgBaGzWa</recordid><startdate>20240906</startdate><enddate>20240906</enddate><creator>Ramapuram, Jason</creator><creator>Danieli, Federico</creator><creator>Dhekane, Eeshan</creator><creator>Weers, Floris</creator><creator>Busbridge, Dan</creator><creator>Ablin, Pierre</creator><creator>Likhomanenko, Tatiana</creator><creator>Digani, Jagrit</creator><creator>Gu, Zijin</creator><creator>Shidani, Amitis</creator><creator>Webb, Russ</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240906</creationdate><title>Theory, Analysis, and Best Practices for Sigmoid Self-Attention</title><author>Ramapuram, Jason ; Danieli, Federico ; Dhekane, Eeshan ; Weers, Floris ; Busbridge, Dan ; Ablin, Pierre ; Likhomanenko, Tatiana ; Digani, Jagrit ; Gu, Zijin ; Shidani, Amitis ; Webb, Russ</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_044313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Ramapuram, Jason</creatorcontrib><creatorcontrib>Danieli, Federico</creatorcontrib><creatorcontrib>Dhekane, Eeshan</creatorcontrib><creatorcontrib>Weers, Floris</creatorcontrib><creatorcontrib>Busbridge, Dan</creatorcontrib><creatorcontrib>Ablin, Pierre</creatorcontrib><creatorcontrib>Likhomanenko, Tatiana</creatorcontrib><creatorcontrib>Digani, Jagrit</creatorcontrib><creatorcontrib>Gu, Zijin</creatorcontrib><creatorcontrib>Shidani, Amitis</creatorcontrib><creatorcontrib>Webb, Russ</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ramapuram, Jason</au><au>Danieli, Federico</au><au>Dhekane, Eeshan</au><au>Weers, Floris</au><au>Busbridge, Dan</au><au>Ablin, Pierre</au><au>Likhomanenko, Tatiana</au><au>Digani, Jagrit</au><au>Gu, Zijin</au><au>Shidani, Amitis</au><au>Webb, Russ</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Theory, Analysis, and Best Practices for Sigmoid Self-Attention</atitle><date>2024-09-06</date><risdate>2024</risdate><abstract>Attention is a key part of the transformer architecture. It is a
sequence-to-sequence mapping that transforms each sequence element into a
weighted sum of values. The weights are typically obtained as the softmax of
dot products between keys and queries. Recent work has explored alternatives to
softmax attention in transformers, such as ReLU and sigmoid activations. In
this work, we revisit sigmoid attention and conduct an in-depth theoretical and
empirical analysis. Theoretically, we prove that transformers with sigmoid
attention are universal function approximators and benefit from improved
regularity compared to softmax attention. Through detailed empirical analysis,
we identify stabilization of large initial attention norms during the early
stages of training as a crucial factor for the successful training of models
with sigmoid attention, outperforming prior attempts. We also introduce
FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid
attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100
GPUs. Experiments across language, vision, and speech show that properly
normalized sigmoid attention matches the strong performance of softmax
attention on a wide range of domains and scales, which previous attempts at
sigmoid attention were unable to fully achieve. Our work unifies prior art and
establishes best practices for sigmoid attention as a drop-in softmax
replacement in transformers.</abstract><doi>10.48550/arxiv.2409.04431</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2409.04431 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2409_04431 |
source | arXiv.org |
subjects | Computer Science - Learning |
title | Theory, Analysis, and Best Practices for Sigmoid Self-Attention |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T13%3A35%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Theory,%20Analysis,%20and%20Best%20Practices%20for%20Sigmoid%20Self-Attention&rft.au=Ramapuram,%20Jason&rft.date=2024-09-06&rft_id=info:doi/10.48550/arxiv.2409.04431&rft_dat=%3Carxiv_GOX%3E2409_04431%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |