HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification

Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied intelligence (Dordrecht, Netherlands) Netherlands), 2023-11, Vol.53 (21), p.24947-24962
Hauptverfasser:	Guo, Dongen, Wu, Zechen, Feng, Jiangfan, Zhou, Zhuoke, Shen, Zhen
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Artificial Intelligence Artificial neural networks Computer Science Datasets Image classification Lightweight Machines Manufacturing Mechanical Engineering Processes Remote sensing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	24962
container_issue	21
container_start_page	24947
container_title	Applied intelligence (Dordrecht, Netherlands)
container_volume	53
creator	Guo, Dongen Wu, Zechen Feng, Jiangfan Zhou, Zhuoke Shen, Zhen
description	Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv ∗ ), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model’s computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.
doi_str_mv	10.1007/s10489-023-04725-y
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2880579708</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2880579708</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-cdec3fc7d5033cf4c098efdef4470735c5d9488259f9f10a24ad6f19c25f2fd43</originalsourceid><addsrcrecordid>eNp9UE1LAzEUDKJgrf4BTwHPqy8f22y8SalWKHip4i0s2Zdtyna3Jltl_72pK3jz8ob3mJk3DCHXDG4ZgLqLDGShM-AiA6l4ng0nZMJyJTIltTolE9BcZrOZfj8nFzFuAUAIYBOCy8Xqza_v6cbXm2ag6Jy3HtueNunQf-Fx0k8ffdfSPpRtdF3YYaAJaMBd1yON2Ebf1tTvyjptFluktilj9Mmr7JPykpy5sol49YtT8vq4WM-X2erl6Xn-sMqsYLrPbIVWOKuqPKWzTlrQBboKnZQKlMhtXmlZFDzXTjsGJZdlNXNMW5477ioppuRm9N2H7uOAsTfb7hDa9NLwooBcaQVFYvGRZUMXY0Bn9iFlD4NhYI51mrFOk-o0P3WaIYnEKIqJ3NYY_qz_UX0DStZ7Bw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2880579708</pqid></control><display><type>article</type><title>HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification</title><source>SpringerLink Journals - AutoHoldings</source><creator>Guo, Dongen ; Wu, Zechen ; Feng, Jiangfan ; Zhou, Zhuoke ; Shen, Zhen</creator><creatorcontrib>Guo, Dongen ; Wu, Zechen ; Feng, Jiangfan ; Zhou, Zhuoke ; Shen, Zhen</creatorcontrib><description>Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv ∗ ), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model’s computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.</description><identifier>ISSN: 0924-669X</identifier><identifier>EISSN: 1573-7497</identifier><identifier>DOI: 10.1007/s10489-023-04725-y</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Accuracy ; Artificial Intelligence ; Artificial neural networks ; Computer Science ; Datasets ; Image classification ; Lightweight ; Machines ; Manufacturing ; Mechanical Engineering ; Processes ; Remote sensing</subject><ispartof>Applied intelligence (Dordrecht, Netherlands), 2023-11, Vol.53 (21), p.24947-24962</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-cdec3fc7d5033cf4c098efdef4470735c5d9488259f9f10a24ad6f19c25f2fd43</citedby><cites>FETCH-LOGICAL-c319t-cdec3fc7d5033cf4c098efdef4470735c5d9488259f9f10a24ad6f19c25f2fd43</cites><orcidid>0000-0003-3927-7616</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s10489-023-04725-y$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s10489-023-04725-y$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Guo, Dongen</creatorcontrib><creatorcontrib>Wu, Zechen</creatorcontrib><creatorcontrib>Feng, Jiangfan</creatorcontrib><creatorcontrib>Zhou, Zhuoke</creatorcontrib><creatorcontrib>Shen, Zhen</creatorcontrib><title>HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification</title><title>Applied intelligence (Dordrecht, Netherlands)</title><addtitle>Appl Intell</addtitle><description>Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv ∗ ), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model’s computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.</description><subject>Accuracy</subject><subject>Artificial Intelligence</subject><subject>Artificial neural networks</subject><subject>Computer Science</subject><subject>Datasets</subject><subject>Image classification</subject><subject>Lightweight</subject><subject>Machines</subject><subject>Manufacturing</subject><subject>Mechanical Engineering</subject><subject>Processes</subject><subject>Remote sensing</subject><issn>0924-669X</issn><issn>1573-7497</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><recordid>eNp9UE1LAzEUDKJgrf4BTwHPqy8f22y8SalWKHip4i0s2Zdtyna3Jltl_72pK3jz8ob3mJk3DCHXDG4ZgLqLDGShM-AiA6l4ng0nZMJyJTIltTolE9BcZrOZfj8nFzFuAUAIYBOCy8Xqza_v6cbXm2ag6Jy3HtueNunQf-Fx0k8ffdfSPpRtdF3YYaAJaMBd1yON2Ebf1tTvyjptFluktilj9Mmr7JPykpy5sol49YtT8vq4WM-X2erl6Xn-sMqsYLrPbIVWOKuqPKWzTlrQBboKnZQKlMhtXmlZFDzXTjsGJZdlNXNMW5477ioppuRm9N2H7uOAsTfb7hDa9NLwooBcaQVFYvGRZUMXY0Bn9iFlD4NhYI51mrFOk-o0P3WaIYnEKIqJ3NYY_qz_UX0DStZ7Bw</recordid><startdate>20231101</startdate><enddate>20231101</enddate><creator>Guo, Dongen</creator><creator>Wu, Zechen</creator><creator>Feng, Jiangfan</creator><creator>Zhou, Zhuoke</creator><creator>Shen, Zhen</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L6V</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PSYQQ</scope><scope>PTHSS</scope><scope>Q9U</scope><orcidid>https://orcid.org/0000-0003-3927-7616</orcidid></search><sort><creationdate>20231101</creationdate><title>HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification</title><author>Guo, Dongen ; Wu, Zechen ; Feng, Jiangfan ; Zhou, Zhuoke ; Shen, Zhen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-cdec3fc7d5033cf4c098efdef4470735c5d9488259f9f10a24ad6f19c25f2fd43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Accuracy</topic><topic>Artificial Intelligence</topic><topic>Artificial neural networks</topic><topic>Computer Science</topic><topic>Datasets</topic><topic>Image classification</topic><topic>Lightweight</topic><topic>Machines</topic><topic>Manufacturing</topic><topic>Mechanical Engineering</topic><topic>Processes</topic><topic>Remote sensing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Guo, Dongen</creatorcontrib><creatorcontrib>Wu, Zechen</creatorcontrib><creatorcontrib>Feng, Jiangfan</creatorcontrib><creatorcontrib>Zhou, Zhuoke</creatorcontrib><creatorcontrib>Shen, Zhen</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>Access via ABI/INFORM (ProQuest)</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>ProQuest Engineering Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Business</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>ProQuest One Psychology</collection><collection>Engineering Collection</collection><collection>ProQuest Central Basic</collection><jtitle>Applied intelligence (Dordrecht, Netherlands)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Guo, Dongen</au><au>Wu, Zechen</au><au>Feng, Jiangfan</au><au>Zhou, Zhuoke</au><au>Shen, Zhen</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification</atitle><jtitle>Applied intelligence (Dordrecht, Netherlands)</jtitle><stitle>Appl Intell</stitle><date>2023-11-01</date><risdate>2023</risdate><volume>53</volume><issue>21</issue><spage>24947</spage><epage>24962</epage><pages>24947-24962</pages><issn>0924-669X</issn><eissn>1573-7497</eissn><abstract>Remote sensing image scene classification methods based on convolutional neural networks (CNN) have been extremely successful. However, the limitations of CNN itself make it difficult to acquire global information. The traditional Vision Transformer can effectively capture long-distance dependencies for acquiring global information, but it is computationally intensive. In addition, each class of scene in remote sensing images has a large quantity of the similar background or foreground features. To effectively leverage those similar features and reduce the computation, a highly efficient lightweight vision transformer (HELViT) is proposed. HELViT is a hybrid model combining CNN and Transformer and consists of the Convolution and Attention Block (CAB), the Convolution and Token Merging Block (CTMB). Specifically, in CAB module, the embedding layer in the original Vision Transformer is replaced with a modified MBConv (MBConv ∗ ), and the Fast Multi-Head Self Attention (F-MHSA) is used to change the quadratic complexity of the self-attention mechanism to linear. To further decreasing the model’s computational cost, CTMB employs the adaptive token merging (ATOME) to fuse some related foreground or background features. The experimental results on the UCM, AID and NWPU datasets show that the proposed model displays better results in terms of accuracy and efficiency than the state-of-the-art remote sensing scene classification methods. On the most challenging NWPU dataset, HELViT achieves the highest accuracy of 94.64%/96.84% with 4.6G GMACs for 10%/20% training samples, respectively.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s10489-023-04725-y</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0003-3927-7616</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 0924-669X
ispartof	Applied intelligence (Dordrecht, Netherlands), 2023-11, Vol.53 (21), p.24947-24962
issn	0924-669X 1573-7497
language	eng
recordid	cdi_proquest_journals_2880579708
source	SpringerLink Journals - AutoHoldings
subjects	Accuracy Artificial Intelligence Artificial neural networks Computer Science Datasets Image classification Lightweight Machines Manufacturing Mechanical Engineering Processes Remote sensing
title	HELViT: highly efficient lightweight vision transformer for remote sensing image scene classification
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T21%3A43%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=HELViT:%20highly%20efficient%20lightweight%20vision%20transformer%20for%20remote%20sensing%20image%20scene%20classification&rft.jtitle=Applied%20intelligence%20(Dordrecht,%20Netherlands)&rft.au=Guo,%20Dongen&rft.date=2023-11-01&rft.volume=53&rft.issue=21&rft.spage=24947&rft.epage=24962&rft.pages=24947-24962&rft.issn=0924-669X&rft.eissn=1573-7497&rft_id=info:doi/10.1007/s10489-023-04725-y&rft_dat=%3Cproquest_cross%3E2880579708%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2880579708&rft_id=info:pmid/&rfr_iscdi=true