On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to ta...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2021-06
Hauptverfasser: Ji, Tianchu, Jain, Shraddhan, Ferdman, Michael, Milder, Peter, Schwartz, H Andrew, Balasubramanian, Niranjan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Ji, Tianchu
Jain, Shraddhan
Ferdman, Michael
Milder, Peter
Schwartz, H Andrew
Balasubramanian, Niranjan
description How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. \(2^3\)) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (\(< 1.0\%\)) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2536667393</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2536667393</sourcerecordid><originalsourceid>FETCH-proquest_journals_25366673933</originalsourceid><addsrcrecordid>eNqNys0OATEUhuFGIiG4h5PYmmS0Zoal-AkrEWIrxWl0wuloTxdcvSEuwOrLm-9piLZUapiMR1K2RC-EMk1TmRcyy1RbnDYEfEWY28DeniJbRwPYVdoHy88BaLrAmgx6pDMmbO8I26iJ7Ut_KDgDU2akbxz0LWIAS7D3moJx_o4-dEXT6FvA3m87or9c7GerpPLuUXs-li56qq-jzFSe54WaKPWfegOEDkXd</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2536667393</pqid></control><display><type>article</type><title>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</title><source>Free E- Journals</source><creator>Ji, Tianchu ; Jain, Shraddhan ; Ferdman, Michael ; Milder, Peter ; Schwartz, H Andrew ; Balasubramanian, Niranjan</creator><creatorcontrib>Ji, Tianchu ; Jain, Shraddhan ; Ferdman, Michael ; Milder, Peter ; Schwartz, H Andrew ; Balasubramanian, Niranjan</creatorcontrib><description>How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. \(2^3\)) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (\(&lt; 1.0\%\)) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accuracy ; Data mining ; Inference ; Measurement ; Pruning ; Questions ; Sentiment analysis ; Sparsity ; Transformers</subject><ispartof>arXiv.org, 2021-06</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Ji, Tianchu</creatorcontrib><creatorcontrib>Jain, Shraddhan</creatorcontrib><creatorcontrib>Ferdman, Michael</creatorcontrib><creatorcontrib>Milder, Peter</creatorcontrib><creatorcontrib>Schwartz, H Andrew</creatorcontrib><creatorcontrib>Balasubramanian, Niranjan</creatorcontrib><title>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</title><title>arXiv.org</title><description>How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. \(2^3\)) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (\(&lt; 1.0\%\)) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.</description><subject>Accuracy</subject><subject>Data mining</subject><subject>Inference</subject><subject>Measurement</subject><subject>Pruning</subject><subject>Questions</subject><subject>Sentiment analysis</subject><subject>Sparsity</subject><subject>Transformers</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNys0OATEUhuFGIiG4h5PYmmS0Zoal-AkrEWIrxWl0wuloTxdcvSEuwOrLm-9piLZUapiMR1K2RC-EMk1TmRcyy1RbnDYEfEWY28DeniJbRwPYVdoHy88BaLrAmgx6pDMmbO8I26iJ7Ut_KDgDU2akbxz0LWIAS7D3moJx_o4-dEXT6FvA3m87or9c7GerpPLuUXs-li56qq-jzFSe54WaKPWfegOEDkXd</recordid><startdate>20210602</startdate><enddate>20210602</enddate><creator>Ji, Tianchu</creator><creator>Jain, Shraddhan</creator><creator>Ferdman, Michael</creator><creator>Milder, Peter</creator><creator>Schwartz, H Andrew</creator><creator>Balasubramanian, Niranjan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210602</creationdate><title>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</title><author>Ji, Tianchu ; Jain, Shraddhan ; Ferdman, Michael ; Milder, Peter ; Schwartz, H Andrew ; Balasubramanian, Niranjan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25366673933</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Accuracy</topic><topic>Data mining</topic><topic>Inference</topic><topic>Measurement</topic><topic>Pruning</topic><topic>Questions</topic><topic>Sentiment analysis</topic><topic>Sparsity</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Ji, Tianchu</creatorcontrib><creatorcontrib>Jain, Shraddhan</creatorcontrib><creatorcontrib>Ferdman, Michael</creatorcontrib><creatorcontrib>Milder, Peter</creatorcontrib><creatorcontrib>Schwartz, H Andrew</creatorcontrib><creatorcontrib>Balasubramanian, Niranjan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ji, Tianchu</au><au>Jain, Shraddhan</au><au>Ferdman, Michael</au><au>Milder, Peter</au><au>Schwartz, H Andrew</au><au>Balasubramanian, Niranjan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</atitle><jtitle>arXiv.org</jtitle><date>2021-06-02</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. \(2^3\)) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (\(&lt; 1.0\%\)) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2021-06
issn 2331-8422
language eng
recordid cdi_proquest_journals_2536667393
source Free E- Journals
subjects Accuracy
Data mining
Inference
Measurement
Pruning
Questions
Sentiment analysis
Sparsity
Transformers
title On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T16%3A42%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=On%20the%20Distribution,%20Sparsity,%20and%20Inference-time%20Quantization%20of%20Attention%20Values%20in%20Transformers&rft.jtitle=arXiv.org&rft.au=Ji,%20Tianchu&rft.date=2021-06-02&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2536667393%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2536667393&rft_id=info:pmid/&rfr_iscdi=true