SaiT: Sparse Vision Transformers through Adaptive Token Pruning
While vision transformers have achieved impressive results, effectively and efficiently accelerating these models can further boost performances. In this work, we propose a dense/sparse training framework to obtain a unified model, enabling weight sharing across various token densities. Thus one mod...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Li, Ling Thorsley, David Hassoun, Joseph |
description | While vision transformers have achieved impressive results, effectively and
efficiently accelerating these models can further boost performances. In this
work, we propose a dense/sparse training framework to obtain a unified model,
enabling weight sharing across various token densities. Thus one model offers a
range of accuracy and throughput tradeoffs for different applications. Besides,
we introduce adaptive token pruning to optimize the patch token sparsity based
on the input image. In addition, we investigate knowledge distillation to
enhance token selection capability in early transformer modules. Sparse
adaptive image Transformer (SaiT) offers varying levels of model acceleration
by merely changing the token sparsity on the fly. Specifically, SaiT reduces
the computation complexity (FLOPs) by 39% - 43% and increases the throughput by
67% - 91% with less than 0.5% accuracy loss for various vision transformer
models. Meanwhile, the same model also provides the zero accuracy drop option
by skipping the sparsification step. SaiT achieves better accuracy and
computation tradeoffs than state-of-the-art transformer and convolutional
models. |
doi_str_mv | 10.48550/arxiv.2210.05832 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2210_05832</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2210_05832</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-1283a86c9fae8cd61cf1c3302c8c4c6db6bef42de0afad05942030f623b4112d3</originalsourceid><addsrcrecordid>eNotz81KAzEUhuFsXEj1AlyZG5ianGRi6kZK8Q8KCg1uhzP5aYM2M5y0Re9ebV198C4-eBi7kmKqbduKG6SvfJgC_AbRWgXn7H6F2d3x1YhUI3_PNQ-FO8JS00DbSJXvNjTs1xs-Dzju8iFyN3zEwt9oX3JZX7CzhJ81Xv7vhLnHB7d4bpavTy-L-bJBcwuNBKvQGj9LGK0PRvokvVICvPXam9CbPiYNIQpMGEQ70yCUSAZUr6WEoCbs-nR7FHQj5S3Sd_cn6Y4S9QPPVkOv</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SaiT: Sparse Vision Transformers through Adaptive Token Pruning</title><source>arXiv.org</source><creator>Li, Ling ; Thorsley, David ; Hassoun, Joseph</creator><creatorcontrib>Li, Ling ; Thorsley, David ; Hassoun, Joseph</creatorcontrib><description>While vision transformers have achieved impressive results, effectively and
efficiently accelerating these models can further boost performances. In this
work, we propose a dense/sparse training framework to obtain a unified model,
enabling weight sharing across various token densities. Thus one model offers a
range of accuracy and throughput tradeoffs for different applications. Besides,
we introduce adaptive token pruning to optimize the patch token sparsity based
on the input image. In addition, we investigate knowledge distillation to
enhance token selection capability in early transformer modules. Sparse
adaptive image Transformer (SaiT) offers varying levels of model acceleration
by merely changing the token sparsity on the fly. Specifically, SaiT reduces
the computation complexity (FLOPs) by 39% - 43% and increases the throughput by
67% - 91% with less than 0.5% accuracy loss for various vision transformer
models. Meanwhile, the same model also provides the zero accuracy drop option
by skipping the sparsification step. SaiT achieves better accuracy and
computation tradeoffs than state-of-the-art transformer and convolutional
models.</description><identifier>DOI: 10.48550/arxiv.2210.05832</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2022-10</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2210.05832$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2210.05832$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Ling</creatorcontrib><creatorcontrib>Thorsley, David</creatorcontrib><creatorcontrib>Hassoun, Joseph</creatorcontrib><title>SaiT: Sparse Vision Transformers through Adaptive Token Pruning</title><description>While vision transformers have achieved impressive results, effectively and
efficiently accelerating these models can further boost performances. In this
work, we propose a dense/sparse training framework to obtain a unified model,
enabling weight sharing across various token densities. Thus one model offers a
range of accuracy and throughput tradeoffs for different applications. Besides,
we introduce adaptive token pruning to optimize the patch token sparsity based
on the input image. In addition, we investigate knowledge distillation to
enhance token selection capability in early transformer modules. Sparse
adaptive image Transformer (SaiT) offers varying levels of model acceleration
by merely changing the token sparsity on the fly. Specifically, SaiT reduces
the computation complexity (FLOPs) by 39% - 43% and increases the throughput by
67% - 91% with less than 0.5% accuracy loss for various vision transformer
models. Meanwhile, the same model also provides the zero accuracy drop option
by skipping the sparsification step. SaiT achieves better accuracy and
computation tradeoffs than state-of-the-art transformer and convolutional
models.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81KAzEUhuFsXEj1AlyZG5ianGRi6kZK8Q8KCg1uhzP5aYM2M5y0Re9ebV198C4-eBi7kmKqbduKG6SvfJgC_AbRWgXn7H6F2d3x1YhUI3_PNQ-FO8JS00DbSJXvNjTs1xs-Dzju8iFyN3zEwt9oX3JZX7CzhJ81Xv7vhLnHB7d4bpavTy-L-bJBcwuNBKvQGj9LGK0PRvokvVICvPXam9CbPiYNIQpMGEQ70yCUSAZUr6WEoCbs-nR7FHQj5S3Sd_cn6Y4S9QPPVkOv</recordid><startdate>20221011</startdate><enddate>20221011</enddate><creator>Li, Ling</creator><creator>Thorsley, David</creator><creator>Hassoun, Joseph</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20221011</creationdate><title>SaiT: Sparse Vision Transformers through Adaptive Token Pruning</title><author>Li, Ling ; Thorsley, David ; Hassoun, Joseph</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-1283a86c9fae8cd61cf1c3302c8c4c6db6bef42de0afad05942030f623b4112d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Ling</creatorcontrib><creatorcontrib>Thorsley, David</creatorcontrib><creatorcontrib>Hassoun, Joseph</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Ling</au><au>Thorsley, David</au><au>Hassoun, Joseph</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SaiT: Sparse Vision Transformers through Adaptive Token Pruning</atitle><date>2022-10-11</date><risdate>2022</risdate><abstract>While vision transformers have achieved impressive results, effectively and
efficiently accelerating these models can further boost performances. In this
work, we propose a dense/sparse training framework to obtain a unified model,
enabling weight sharing across various token densities. Thus one model offers a
range of accuracy and throughput tradeoffs for different applications. Besides,
we introduce adaptive token pruning to optimize the patch token sparsity based
on the input image. In addition, we investigate knowledge distillation to
enhance token selection capability in early transformer modules. Sparse
adaptive image Transformer (SaiT) offers varying levels of model acceleration
by merely changing the token sparsity on the fly. Specifically, SaiT reduces
the computation complexity (FLOPs) by 39% - 43% and increases the throughput by
67% - 91% with less than 0.5% accuracy loss for various vision transformer
models. Meanwhile, the same model also provides the zero accuracy drop option
by skipping the sparsification step. SaiT achieves better accuracy and
computation tradeoffs than state-of-the-art transformer and convolutional
models.</abstract><doi>10.48550/arxiv.2210.05832</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2210.05832 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2210_05832 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | SaiT: Sparse Vision Transformers through Adaptive Token Pruning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T17%3A31%3A06IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SaiT:%20Sparse%20Vision%20Transformers%20through%20Adaptive%20Token%20Pruning&rft.au=Li,%20Ling&rft.date=2022-10-11&rft_id=info:doi/10.48550/arxiv.2210.05832&rft_dat=%3Carxiv_GOX%3E2210_05832%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |