On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers
How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to ta...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2021-06 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Ji, Tianchu Jain, Shraddhan Ferdman, Michael Milder, Peter Schwartz, H Andrew Balasubramanian, Niranjan |
description | How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. \(2^3\)) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (\(< 1.0\%\)) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2536667393</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2536667393</sourcerecordid><originalsourceid>FETCH-proquest_journals_25366673933</originalsourceid><addsrcrecordid>eNqNys0OATEUhuFGIiG4h5PYmmS0Zoal-AkrEWIrxWl0wuloTxdcvSEuwOrLm-9piLZUapiMR1K2RC-EMk1TmRcyy1RbnDYEfEWY28DeniJbRwPYVdoHy88BaLrAmgx6pDMmbO8I26iJ7Ut_KDgDU2akbxz0LWIAS7D3moJx_o4-dEXT6FvA3m87or9c7GerpPLuUXs-li56qq-jzFSe54WaKPWfegOEDkXd</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2536667393</pqid></control><display><type>article</type><title>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</title><source>Free E- Journals</source><creator>Ji, Tianchu ; Jain, Shraddhan ; Ferdman, Michael ; Milder, Peter ; Schwartz, H Andrew ; Balasubramanian, Niranjan</creator><creatorcontrib>Ji, Tianchu ; Jain, Shraddhan ; Ferdman, Michael ; Milder, Peter ; Schwartz, H Andrew ; Balasubramanian, Niranjan</creatorcontrib><description>How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. \(2^3\)) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (\(< 1.0\%\)) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accuracy ; Data mining ; Inference ; Measurement ; Pruning ; Questions ; Sentiment analysis ; Sparsity ; Transformers</subject><ispartof>arXiv.org, 2021-06</ispartof><rights>2021. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Ji, Tianchu</creatorcontrib><creatorcontrib>Jain, Shraddhan</creatorcontrib><creatorcontrib>Ferdman, Michael</creatorcontrib><creatorcontrib>Milder, Peter</creatorcontrib><creatorcontrib>Schwartz, H Andrew</creatorcontrib><creatorcontrib>Balasubramanian, Niranjan</creatorcontrib><title>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</title><title>arXiv.org</title><description>How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. \(2^3\)) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (\(< 1.0\%\)) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.</description><subject>Accuracy</subject><subject>Data mining</subject><subject>Inference</subject><subject>Measurement</subject><subject>Pruning</subject><subject>Questions</subject><subject>Sentiment analysis</subject><subject>Sparsity</subject><subject>Transformers</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNys0OATEUhuFGIiG4h5PYmmS0Zoal-AkrEWIrxWl0wuloTxdcvSEuwOrLm-9piLZUapiMR1K2RC-EMk1TmRcyy1RbnDYEfEWY28DeniJbRwPYVdoHy88BaLrAmgx6pDMmbO8I26iJ7Ut_KDgDU2akbxz0LWIAS7D3moJx_o4-dEXT6FvA3m87or9c7GerpPLuUXs-li56qq-jzFSe54WaKPWfegOEDkXd</recordid><startdate>20210602</startdate><enddate>20210602</enddate><creator>Ji, Tianchu</creator><creator>Jain, Shraddhan</creator><creator>Ferdman, Michael</creator><creator>Milder, Peter</creator><creator>Schwartz, H Andrew</creator><creator>Balasubramanian, Niranjan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210602</creationdate><title>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</title><author>Ji, Tianchu ; Jain, Shraddhan ; Ferdman, Michael ; Milder, Peter ; Schwartz, H Andrew ; Balasubramanian, Niranjan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25366673933</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Accuracy</topic><topic>Data mining</topic><topic>Inference</topic><topic>Measurement</topic><topic>Pruning</topic><topic>Questions</topic><topic>Sentiment analysis</topic><topic>Sparsity</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Ji, Tianchu</creatorcontrib><creatorcontrib>Jain, Shraddhan</creatorcontrib><creatorcontrib>Ferdman, Michael</creatorcontrib><creatorcontrib>Milder, Peter</creatorcontrib><creatorcontrib>Schwartz, H Andrew</creatorcontrib><creatorcontrib>Balasubramanian, Niranjan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ji, Tianchu</au><au>Jain, Shraddhan</au><au>Ferdman, Michael</au><au>Milder, Peter</au><au>Schwartz, H Andrew</au><au>Balasubramanian, Niranjan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers</atitle><jtitle>arXiv.org</jtitle><date>2021-06-02</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>How much information do NLP tasks really need from a transformer's attention mechanism at application-time (inference)? From recent work, we know that there is sparsity in transformers and that the floating-points within its computation can be discretized to fewer values with minimal loss to task accuracies. However, this requires retraining or even creating entirely new models, both of which can be expensive and carbon-emitting. Focused on optimizations that do not require training, we systematically study the full range of typical attention values necessary. This informs the design of an inference-time quantization technique using both pruning and log-scaled mapping which produces only a few (e.g. \(2^3\)) unique values. Over the tasks of question answering and sentiment analysis, we find nearly 80% of attention values can be pruned to zeros with minimal (\(< 1.0\%\)) relative loss in accuracy. We use this pruning technique in conjunction with quantizing the attention values to only a 3-bit format, without retraining, resulting in only a 0.8% accuracy reduction on question answering with fine-tuned RoBERTa.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2021-06 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2536667393 |
source | Free E- Journals |
subjects | Accuracy Data mining Inference Measurement Pruning Questions Sentiment analysis Sparsity Transformers |
title | On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T16%3A42%3A01IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=On%20the%20Distribution,%20Sparsity,%20and%20Inference-time%20Quantization%20of%20Attention%20Values%20in%20Transformers&rft.jtitle=arXiv.org&rft.au=Ji,%20Tianchu&rft.date=2021-06-02&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2536667393%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2536667393&rft_id=info:pmid/&rfr_iscdi=true |