On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers

How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to ta...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2021-06
Hauptverfasser:	Ji, Tianchu, Jain, Shraddhan, Ferdman, Michael, Milder, Peter, Schwartz, H Andrew, Balasubramanian, Niranjan
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Data mining Inference Measurement Pruning Questions Sentiment analysis Sparsity Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Ji, Tianchu Jain, Shraddhan Ferdman, Michael Milder, Peter Schwartz, H Andrew Balasubramanian, Niranjan
description	How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. $2^3$) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal ($< 1.0\%$) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2536667393</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2536667393</sourcerecordid><originalsourceid>FETCH-proquest_journals_25366673933</originalsourceid><addsrcrecordid>eNqNys0OATEUhuFGIiG4h5PYmmS0Zoal-AkrEWIrxWl0wuloTxdcvSEuwOrLm-9piLZUapiMR1K2RC-EMk1TmRcyy1RbnDYEfEWY28DeniJbRwPYVdoHy88BaLrAmgx6pDMmbO8I26iJ7Ut_KDgDU2akbxz0LWIAS7D3moJx_o4-dEXT6FvA3m87or9c7GerpPLuUXs-li56qq-jzFSe54WaKPWfegOEDkXd</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2536667393</pqid></control><display><type>article</type><title>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</title><source>Free E- Journals</source><creator>Ji, Tianchu ; Jain, Shraddhan ; Ferdman, Michael ; Milder, Peter ; Schwartz, H Andrew ; Balasubramanian, Niranjan</creator><creatorcontrib>Ji, Tianchu ; Jain, Shraddhan ; Ferdman, Michael ; Milder, Peter ; Schwartz, H Andrew ; Balasubramanian, Niranjan</creatorcontrib><description>How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. $2^3$) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal ($< 1.0\%$) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accuracy ; Data mining ; Inference ; Measurement ; Pruning ; Questions ; Sentiment analysis ; Sparsity ; Transformers</subject><ispartof>arXiv.org, 2021-06</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Ji, Tianchu</creatorcontrib><creatorcontrib>Jain, Shraddhan</creatorcontrib><creatorcontrib>Ferdman, Michael</creatorcontrib><creatorcontrib>Milder, Peter</creatorcontrib><creatorcontrib>Schwartz, H Andrew</creatorcontrib><creatorcontrib>Balasubramanian, Niranjan</creatorcontrib><title>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</title><title>arXiv.org</title><description>How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. $2^3$) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal ($< 1.0\%$) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.</description><subject>Accuracy</subject><subject>Data mining</subject><subject>Inference</subject><subject>Measurement</subject><subject>Pruning</subject><subject>Questions</subject><subject>Sentiment analysis</subject><subject>Sparsity</subject><subject>Transformers</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNys0OATEUhuFGIiG4h5PYmmS0Zoal-AkrEWIrxWl0wuloTxdcvSEuwOrLm-9piLZUapiMR1K2RC-EMk1TmRcyy1RbnDYEfEWY28DeniJbRwPYVdoHy88BaLrAmgx6pDMmbO8I26iJ7Ut_KDgDU2akbxz0LWIAS7D3moJx_o4-dEXT6FvA3m87or9c7GerpPLuUXs-li56qq-jzFSe54WaKPWfegOEDkXd</recordid><startdate>20210602</startdate><enddate>20210602</enddate><creator>Ji, Tianchu</creator><creator>Jain, Shraddhan</creator><creator>Ferdman, Michael</creator><creator>Milder, Peter</creator><creator>Schwartz, H Andrew</creator><creator>Balasubramanian, Niranjan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210602</creationdate><title>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</title><author>Ji, Tianchu ; Jain, Shraddhan ; Ferdman, Michael ; Milder, Peter ; Schwartz, H Andrew ; Balasubramanian, Niranjan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25366673933</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Accuracy</topic><topic>Data mining</topic><topic>Inference</topic><topic>Measurement</topic><topic>Pruning</topic><topic>Questions</topic><topic>Sentiment analysis</topic><topic>Sparsity</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Ji, Tianchu</creatorcontrib><creatorcontrib>Jain, Shraddhan</creatorcontrib><creatorcontrib>Ferdman, Michael</creatorcontrib><creatorcontrib>Milder, Peter</creatorcontrib><creatorcontrib>Schwartz, H Andrew</creatorcontrib><creatorcontrib>Balasubramanian, Niranjan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ji, Tianchu</au><au>Jain, Shraddhan</au><au>Ferdman, Michael</au><au>Milder, Peter</au><au>Schwartz, H Andrew</au><au>Balasubramanian, Niranjan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</atitle><jtitle>arXiv.org</jtitle><date>2021-06-02</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. $2^3$) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal ($< 1.0\%$) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2021-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2536667393
source	Free E- Journals
subjects	Accuracy Data mining Inference Measurement Pruning Questions Sentiment analysis Sparsity Transformers
title	On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T16%3A42%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=On%20the%20Distribution,%20Sparsity,%20and%20Inference-time%20Quantization%20of%20Attention%20Values%20in%20Transformers&rft.jtitle=arXiv.org&rft.au=Ji,%20Tianchu&rft.date=2021-06-02&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2536667393%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2536667393&rft_id=info:pmid/&rfr_iscdi=true