Accelerating Transformers with Spectrum-Preserving Token Merging

Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Trans...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Tran, Hoai-Chau, Nguyen, Duy M. H, Nguyen, Duy M, Nguyen, Trung-Tin, Le, Ngan, Xie, Pengtao, Sonntag, Daniel, Zou, James Y, Nguyen, Binh T, Niepert, Mathias
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Tran, Hoai-Chau Nguyen, Duy M. H Nguyen, Duy M Nguyen, Trung-Tin Le, Ngan Xie, Pengtao Sonntag, Daniel Zou, James Y Nguyen, Binh T Niepert, Mathias
description	Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the energy score. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60\% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5\% average performance drop of ViT-MAE-H compared to 2.6\% as baselines), image-text retrieval (0.3\% average performance drop of CLIP on Flickr30k compared to 4.5\% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties of the original token space under mild conditions
doi_str_mv	10.48550/arxiv.2405.16148
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_16148</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_16148</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-b5b1cbd61b7b0515fde46e1ee202b38c31b1cd40e1365c9e7b6e966930ac9a973</originalsourceid><addsrcrecordid>eNotj81ugkAUhWfjolEfoKvyAuBc5gdmV2P8aaLRRPZkZrhYUkFzQVvfvoquTk7Ol5N8jL0Dj2SqFJ9Y-quuUSy5ikCDTN_Y59R7PCLZrmoOQUa2acsT1Uht8Ft138H-jL6jSx3uCFuka0-dfrAJNkiHexuxQWmPLY5fOWTZYp7NVuF6u_yaTdeh1UkaOuXAu0KDSxxXoMoCpUZAjHnsROoF3PdCcgShlTeYOI1GayO49caaRAzZx_O2V8jPVNWWbvlDJe9VxD-kjESx</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Accelerating Transformers with Spectrum-Preserving Token Merging</title><source>arXiv.org</source><creator>Tran, Hoai-Chau ; Nguyen, Duy M. H ; Nguyen, Duy M ; Nguyen, Trung-Tin ; Le, Ngan ; Xie, Pengtao ; Sonntag, Daniel ; Zou, James Y ; Nguyen, Binh T ; Niepert, Mathias</creator><creatorcontrib>Tran, Hoai-Chau ; Nguyen, Duy M. H ; Nguyen, Duy M ; Nguyen, Trung-Tin ; Le, Ngan ; Xie, Pengtao ; Sonntag, Daniel ; Zou, James Y ; Nguyen, Binh T ; Niepert, Mathias</creatorcontrib><description>Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the energy score. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60\% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5\% average performance drop of ViT-MAE-H compared to 2.6\% as baselines), image-text retrieval (0.3\% average performance drop of CLIP on Flickr30k compared to 4.5\% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties of the original token space under mild conditions</description><identifier>DOI: 10.48550/arxiv.2405.16148</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-05</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.16148$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.16148$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Tran, Hoai-Chau</creatorcontrib><creatorcontrib>Nguyen, Duy M. H</creatorcontrib><creatorcontrib>Nguyen, Duy M</creatorcontrib><creatorcontrib>Nguyen, Trung-Tin</creatorcontrib><creatorcontrib>Le, Ngan</creatorcontrib><creatorcontrib>Xie, Pengtao</creatorcontrib><creatorcontrib>Sonntag, Daniel</creatorcontrib><creatorcontrib>Zou, James Y</creatorcontrib><creatorcontrib>Nguyen, Binh T</creatorcontrib><creatorcontrib>Niepert, Mathias</creatorcontrib><title>Accelerating Transformers with Spectrum-Preserving Token Merging</title><description>Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the energy score. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60\% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5\% average performance drop of ViT-MAE-H compared to 2.6\% as baselines), image-text retrieval (0.3\% average performance drop of CLIP on Flickr30k compared to 4.5\% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties of the original token space under mild conditions</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81ugkAUhWfjolEfoKvyAuBc5gdmV2P8aaLRRPZkZrhYUkFzQVvfvoquTk7Ol5N8jL0Dj2SqFJ9Y-quuUSy5ikCDTN_Y59R7PCLZrmoOQUa2acsT1Uht8Ft138H-jL6jSx3uCFuka0-dfrAJNkiHexuxQWmPLY5fOWTZYp7NVuF6u_yaTdeh1UkaOuXAu0KDSxxXoMoCpUZAjHnsROoF3PdCcgShlTeYOI1GayO49caaRAzZx_O2V8jPVNWWbvlDJe9VxD-kjESx</recordid><startdate>20240525</startdate><enddate>20240525</enddate><creator>Tran, Hoai-Chau</creator><creator>Nguyen, Duy M. H</creator><creator>Nguyen, Duy M</creator><creator>Nguyen, Trung-Tin</creator><creator>Le, Ngan</creator><creator>Xie, Pengtao</creator><creator>Sonntag, Daniel</creator><creator>Zou, James Y</creator><creator>Nguyen, Binh T</creator><creator>Niepert, Mathias</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240525</creationdate><title>Accelerating Transformers with Spectrum-Preserving Token Merging</title><author>Tran, Hoai-Chau ; Nguyen, Duy M. H ; Nguyen, Duy M ; Nguyen, Trung-Tin ; Le, Ngan ; Xie, Pengtao ; Sonntag, Daniel ; Zou, James Y ; Nguyen, Binh T ; Niepert, Mathias</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-b5b1cbd61b7b0515fde46e1ee202b38c31b1cd40e1365c9e7b6e966930ac9a973</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Tran, Hoai-Chau</creatorcontrib><creatorcontrib>Nguyen, Duy M. H</creatorcontrib><creatorcontrib>Nguyen, Duy M</creatorcontrib><creatorcontrib>Nguyen, Trung-Tin</creatorcontrib><creatorcontrib>Le, Ngan</creatorcontrib><creatorcontrib>Xie, Pengtao</creatorcontrib><creatorcontrib>Sonntag, Daniel</creatorcontrib><creatorcontrib>Zou, James Y</creatorcontrib><creatorcontrib>Nguyen, Binh T</creatorcontrib><creatorcontrib>Niepert, Mathias</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tran, Hoai-Chau</au><au>Nguyen, Duy M. H</au><au>Nguyen, Duy M</au><au>Nguyen, Trung-Tin</au><au>Le, Ngan</au><au>Xie, Pengtao</au><au>Sonntag, Daniel</au><au>Zou, James Y</au><au>Nguyen, Binh T</au><au>Niepert, Mathias</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Accelerating Transformers with Spectrum-Preserving Token Merging</atitle><date>2024-05-25</date><risdate>2024</risdate><abstract>Increasing the throughput of the Transformer architecture, a foundational component used in numerous state-of-the-art models for vision and language tasks (e.g., GPT, LLaVa), is an important problem in machine learning. One recent and effective strategy is to merge token representations within Transformer models, aiming to reduce computational and memory requirements while maintaining accuracy. Prior works have proposed algorithms based on Bipartite Soft Matching (BSM), which divides tokens into distinct sets and merges the top k similar tokens. However, these methods have significant drawbacks, such as sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents a novel paradigm called PiToMe, which prioritizes the preservation of informative tokens using an additional metric termed the energy score. This score identifies large clusters of similar tokens as high-energy, indicating potential candidates for merging, while smaller (unique and isolated) clusters are considered as low-energy and preserved. Experimental findings demonstrate that PiToMe saved from 40-60\% FLOPs of the base models while exhibiting superior off-the-shelf performance on image classification (0.5\% average performance drop of ViT-MAE-H compared to 2.6\% as baselines), image-text retrieval (0.3\% average performance drop of CLIP on Flickr30k compared to 4.5\% as others), and analogously in visual questions answering with LLaVa-7B. Furthermore, PiToMe is theoretically shown to preserve intrinsic spectral properties of the original token space under mild conditions</abstract><doi>10.48550/arxiv.2405.16148</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2405.16148
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2405_16148
source	arXiv.org
subjects	Computer Science - Learning
title	Accelerating Transformers with Spectrum-Preserving Token Merging
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-11T17%3A06%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Accelerating%20Transformers%20with%20Spectrum-Preserving%20Token%20Merging&rft.au=Tran,%20Hoai-Chau&rft.date=2024-05-25&rft_id=info:doi/10.48550/arxiv.2405.16148&rft_dat=%3Carxiv_GOX%3E2405_16148%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true