Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks

Despite the exciting performance, Transformer is criticized for its excessive parameters and computation cost. However, compressing Transformer remains as an open problem due to its internal complexity of the layer designs, i.e. , Multi-Head Attention (MHA) and Feed-Forward Network (FFN). To address...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing 2022, Vol.31, p.3386-3398
Hauptverfasser:	Luo, Gen, Zhou, Yiyi, Sun, Xiaoshuai, Wang, Yan, Cao, Liujuan, Wu, Yongjian, Huang, Feiyue, Ji, Rongrong
Format:	Artikel
Sprache:	eng
Schlagworte:	Benchmark testing Computational modeling Convolution Head image captioning Image classification Lightweight Lightweight transformer Parameters reference expression comprehension Subspaces Task analysis Transformations Transformers visual question answering Visualization Writing instruction
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	3398
container_issue
container_start_page	3386
container_title	IEEE transactions on image processing
container_volume	31
creator	Luo, Gen Zhou, Yiyi Sun, Xiaoshuai Wang, Yan Cao, Liujuan Wu, Yongjian Huang, Feiyue Ji, Rongrong
description	Despite the exciting performance, Transformer is criticized for its excessive parameters and computation cost. However, compressing Transformer remains as an open problem due to its internal complexity of the layer designs, i.e. , Multi-Head Attention (MHA) and Feed-Forward Network (FFN). To address this issue, we introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer . LW-Transformer applies Group-wise Transformation to reduce both the parameters and computations of Transformer, while also preserving its two main properties, i.e. , the efficient attention modeling on diverse subspaces of MHA, and the expanding-scaling feature transformation of FFN. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks. To examine the generalization ability, we apply LW-Transformer to the task of image classification, and build its network based on a recently proposed image Transformer called Swin-Transformer, where the effectiveness can be also confirmed.
doi_str_mv	10.1109/TIP.2021.3139234
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_pubmed_primary_35471883</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9763437</ieee_id><sourcerecordid>2656199375</sourcerecordid><originalsourceid>FETCH-LOGICAL-c347t-d8bab73bab46d8e4bd68b2bf1391c0158d5f1de39762c8392b8033ba52b646a73</originalsourceid><addsrcrecordid>eNpdkM1Lw0AQxRdRrFbvgiABL1627vcmRylaCwU9RD14CJtkU1ObbN1NKP73Tmmt4GX2wfzesO8hdEHJiFKS3KbT5xEjjI445Qnj4gCd0ERQTIhgh6CJ1FhTkQzQaQgLQqiQVB2jAZdC0zjmJ-g9dWvjyxDN6vlHt7abGaXetKFyvrE-eq1NNPGuX-G3Oti_lelq10aggAggsWlLPDPtvDdzwEz4DGfoqDLLYM937xC9PNyn40c8e5pMx3czXHChO1zGuck1hyFUGVuRlyrOWV5BJFoQKuNSVrS0PNGKFTHkzGPCAZcsV0IZzYfoZnt35d1Xb0OXNXUo7HJpWuv6kDElFU0SriWg1__Qhet9C78DSjGSKC0EUGRLFd6F4G2VrXzdGP-dUZJtis-g-GxTfLYrHixXu8N93thyb_htGoDLLVBba_dryARmzX8AAOmGuA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2662096744</pqid></control><display><type>article</type><title>Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks</title><source>IEEE Electronic Library (IEL)</source><creator>Luo, Gen ; Zhou, Yiyi ; Sun, Xiaoshuai ; Wang, Yan ; Cao, Liujuan ; Wu, Yongjian ; Huang, Feiyue ; Ji, Rongrong</creator><creatorcontrib>Luo, Gen ; Zhou, Yiyi ; Sun, Xiaoshuai ; Wang, Yan ; Cao, Liujuan ; Wu, Yongjian ; Huang, Feiyue ; Ji, Rongrong</creatorcontrib><description>Despite the exciting performance, Transformer is criticized for its excessive parameters and computation cost. However, compressing Transformer remains as an open problem due to its internal complexity of the layer designs, i.e. , Multi-Head Attention (MHA) and Feed-Forward Network (FFN). To address this issue, we introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer . LW-Transformer applies Group-wise Transformation to reduce both the parameters and computations of Transformer, while also preserving its two main properties, i.e. , the efficient attention modeling on diverse subspaces of MHA, and the expanding-scaling feature transformation of FFN. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks. To examine the generalization ability, we apply LW-Transformer to the task of image classification, and build its network based on a recently proposed image Transformer called Swin-Transformer, where the effectiveness can be also confirmed.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2021.3139234</identifier><identifier>PMID: 35471883</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Benchmark testing ; Computational modeling ; Convolution ; Head ; image captioning ; Image classification ; Lightweight ; Lightweight transformer ; Parameters ; reference expression comprehension ; Subspaces ; Task analysis ; Transformations ; Transformers ; visual question answering ; Visualization ; Writing instruction</subject><ispartof>IEEE transactions on image processing, 2022, Vol.31, p.3386-3398</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c347t-d8bab73bab46d8e4bd68b2bf1391c0158d5f1de39762c8392b8033ba52b646a73</citedby><cites>FETCH-LOGICAL-c347t-d8bab73bab46d8e4bd68b2bf1391c0158d5f1de39762c8392b8033ba52b646a73</cites><orcidid>0000-0002-7645-9606 ; 0000-0002-5110-4526 ; 0000-0003-3912-9306 ; 0000-0001-5334-1843 ; 0000-0001-9163-2932</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9763437$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9763437$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/35471883$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Luo, Gen</creatorcontrib><creatorcontrib>Zhou, Yiyi</creatorcontrib><creatorcontrib>Sun, Xiaoshuai</creatorcontrib><creatorcontrib>Wang, Yan</creatorcontrib><creatorcontrib>Cao, Liujuan</creatorcontrib><creatorcontrib>Wu, Yongjian</creatorcontrib><creatorcontrib>Huang, Feiyue</creatorcontrib><creatorcontrib>Ji, Rongrong</creatorcontrib><title>Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>Despite the exciting performance, Transformer is criticized for its excessive parameters and computation cost. However, compressing Transformer remains as an open problem due to its internal complexity of the layer designs, i.e. , Multi-Head Attention (MHA) and Feed-Forward Network (FFN). To address this issue, we introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer . LW-Transformer applies Group-wise Transformation to reduce both the parameters and computations of Transformer, while also preserving its two main properties, i.e. , the efficient attention modeling on diverse subspaces of MHA, and the expanding-scaling feature transformation of FFN. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks. To examine the generalization ability, we apply LW-Transformer to the task of image classification, and build its network based on a recently proposed image Transformer called Swin-Transformer, where the effectiveness can be also confirmed.</description><subject>Benchmark testing</subject><subject>Computational modeling</subject><subject>Convolution</subject><subject>Head</subject><subject>image captioning</subject><subject>Image classification</subject><subject>Lightweight</subject><subject>Lightweight transformer</subject><subject>Parameters</subject><subject>reference expression comprehension</subject><subject>Subspaces</subject><subject>Task analysis</subject><subject>Transformations</subject><subject>Transformers</subject><subject>visual question answering</subject><subject>Visualization</subject><subject>Writing instruction</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkM1Lw0AQxRdRrFbvgiABL1627vcmRylaCwU9RD14CJtkU1ObbN1NKP73Tmmt4GX2wfzesO8hdEHJiFKS3KbT5xEjjI445Qnj4gCd0ERQTIhgh6CJ1FhTkQzQaQgLQqiQVB2jAZdC0zjmJ-g9dWvjyxDN6vlHt7abGaXetKFyvrE-eq1NNPGuX-G3Oti_lelq10aggAggsWlLPDPtvDdzwEz4DGfoqDLLYM937xC9PNyn40c8e5pMx3czXHChO1zGuck1hyFUGVuRlyrOWV5BJFoQKuNSVrS0PNGKFTHkzGPCAZcsV0IZzYfoZnt35d1Xb0OXNXUo7HJpWuv6kDElFU0SriWg1__Qhet9C78DSjGSKC0EUGRLFd6F4G2VrXzdGP-dUZJtis-g-GxTfLYrHixXu8N93thyb_htGoDLLVBba_dryARmzX8AAOmGuA</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Luo, Gen</creator><creator>Zhou, Yiyi</creator><creator>Sun, Xiaoshuai</creator><creator>Wang, Yan</creator><creator>Cao, Liujuan</creator><creator>Wu, Yongjian</creator><creator>Huang, Feiyue</creator><creator>Ji, Rongrong</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-7645-9606</orcidid><orcidid>https://orcid.org/0000-0002-5110-4526</orcidid><orcidid>https://orcid.org/0000-0003-3912-9306</orcidid><orcidid>https://orcid.org/0000-0001-5334-1843</orcidid><orcidid>https://orcid.org/0000-0001-9163-2932</orcidid></search><sort><creationdate>2022</creationdate><title>Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks</title><author>Luo, Gen ; Zhou, Yiyi ; Sun, Xiaoshuai ; Wang, Yan ; Cao, Liujuan ; Wu, Yongjian ; Huang, Feiyue ; Ji, Rongrong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c347t-d8bab73bab46d8e4bd68b2bf1391c0158d5f1de39762c8392b8033ba52b646a73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Benchmark testing</topic><topic>Computational modeling</topic><topic>Convolution</topic><topic>Head</topic><topic>image captioning</topic><topic>Image classification</topic><topic>Lightweight</topic><topic>Lightweight transformer</topic><topic>Parameters</topic><topic>reference expression comprehension</topic><topic>Subspaces</topic><topic>Task analysis</topic><topic>Transformations</topic><topic>Transformers</topic><topic>visual question answering</topic><topic>Visualization</topic><topic>Writing instruction</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Luo, Gen</creatorcontrib><creatorcontrib>Zhou, Yiyi</creatorcontrib><creatorcontrib>Sun, Xiaoshuai</creatorcontrib><creatorcontrib>Wang, Yan</creatorcontrib><creatorcontrib>Cao, Liujuan</creatorcontrib><creatorcontrib>Wu, Yongjian</creatorcontrib><creatorcontrib>Huang, Feiyue</creatorcontrib><creatorcontrib>Ji, Rongrong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Luo, Gen</au><au>Zhou, Yiyi</au><au>Sun, Xiaoshuai</au><au>Wang, Yan</au><au>Cao, Liujuan</au><au>Wu, Yongjian</au><au>Huang, Feiyue</au><au>Ji, Rongrong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2022</date><risdate>2022</risdate><volume>31</volume><spage>3386</spage><epage>3398</epage><pages>3386-3398</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Despite the exciting performance, Transformer is criticized for its excessive parameters and computation cost. However, compressing Transformer remains as an open problem due to its internal complexity of the layer designs, i.e. , Multi-Head Attention (MHA) and Feed-Forward Network (FFN). To address this issue, we introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer . LW-Transformer applies Group-wise Transformation to reduce both the parameters and computations of Transformer, while also preserving its two main properties, i.e. , the efficient attention modeling on diverse subspaces of MHA, and the expanding-scaling feature transformation of FFN. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks. To examine the generalization ability, we apply LW-Transformer to the task of image classification, and build its network based on a recently proposed image Transformer called Swin-Transformer, where the effectiveness can be also confirmed.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>35471883</pmid><doi>10.1109/TIP.2021.3139234</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0002-7645-9606</orcidid><orcidid>https://orcid.org/0000-0002-5110-4526</orcidid><orcidid>https://orcid.org/0000-0003-3912-9306</orcidid><orcidid>https://orcid.org/0000-0001-5334-1843</orcidid><orcidid>https://orcid.org/0000-0001-9163-2932</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1057-7149
ispartof	IEEE transactions on image processing, 2022, Vol.31, p.3386-3398
issn	1057-7149 1941-0042
language	eng
recordid	cdi_pubmed_primary_35471883
source	IEEE Electronic Library (IEL)
subjects	Benchmark testing Computational modeling Convolution Head image captioning Image classification Lightweight Lightweight transformer Parameters reference expression comprehension Subspaces Task analysis Transformations Transformers visual question answering Visualization Writing instruction
title	Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T19%3A01%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Towards%20Lightweight%20Transformer%20Via%20Group-Wise%20Transformation%20for%20Vision-and-Language%20Tasks&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Luo,%20Gen&rft.date=2022&rft.volume=31&rft.spage=3386&rft.epage=3398&rft.pages=3386-3398&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2021.3139234&rft_dat=%3Cproquest_RIE%3E2656199375%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2662096744&rft_id=info:pmid/35471883&rft_ieee_id=9763437&rfr_iscdi=true