KNN Transformer with Pyramid Prompts for Few-Shot Learning
Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visua...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-10 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Li, Wenhao Wang, Qiangchang Zhao, Peng Yin, Yilong |
description | Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method. |
doi_str_mv | 10.48550/arxiv.2410.10227 |
format | Article |
fullrecord | <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2410_10227</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3116752026</sourcerecordid><originalsourceid>FETCH-LOGICAL-a526-f05f8c1842392f5ff0fe871797cf19803dbbbf2184e38e3eca5a8163fe805e3f3</originalsourceid><addsrcrecordid>eNotj11PwjAUhhsTEwnyA7yyidfD9px17bwzRMS4AIm7X7rRyoj7sB0i_94KXJ3knCfnfR9C7jibxkoI9qjdb_0zhTgsOAOQV2QEiDxSMcANmXi_Y4xBIkEIHJGn9-WS5k633nauMY4e6mFL10enm3pD165r-sHTcKNzc4g-tt1AM6NdW7eft-Ta6i9vJpc5Jvn8JZ8tomz1-jZ7ziItIIksE1ZVPKRjClZYy6xRkstUVpaniuGmLEsLATCoDJpKC614goFiwqDFMbk_vz2JFb2rG-2Oxb9gcRIMxMOZ6F33vTd-KHbd3rWhU4GcJ1JA8MU_LCdSEA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3116752026</pqid></control><display><type>article</type><title>KNN Transformer with Pyramid Prompts for Few-Shot Learning</title><source>Freely Accessible Journals</source><source>arXiv.org</source><creator>Li, Wenhao ; Wang, Qiangchang ; Zhao, Peng ; Yin, Yilong</creator><creatorcontrib>Li, Wenhao ; Wang, Qiangchang ; Zhao, Peng ; Yin, Yilong</creatorcontrib><description>Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2410.10227</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Attention ; Balances (scales) ; Computer Science - Computer Vision and Pattern Recognition ; Context ; Learning ; Representations ; Semantics ; Transformers ; Visual discrimination</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,780,881,27902</link.rule.ids><backlink>$$Uhttps://doi.org/10.48550/arXiv.2410.10227$$DView paper in arXiv$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.1145/3664647.3680601$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Wenhao</creatorcontrib><creatorcontrib>Wang, Qiangchang</creatorcontrib><creatorcontrib>Zhao, Peng</creatorcontrib><creatorcontrib>Yin, Yilong</creatorcontrib><title>KNN Transformer with Pyramid Prompts for Few-Shot Learning</title><title>arXiv.org</title><description>Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.</description><subject>Attention</subject><subject>Balances (scales)</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Context</subject><subject>Learning</subject><subject>Representations</subject><subject>Semantics</subject><subject>Transformers</subject><subject>Visual discrimination</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><sourceid>GOX</sourceid><recordid>eNotj11PwjAUhhsTEwnyA7yyidfD9px17bwzRMS4AIm7X7rRyoj7sB0i_94KXJ3knCfnfR9C7jibxkoI9qjdb_0zhTgsOAOQV2QEiDxSMcANmXi_Y4xBIkEIHJGn9-WS5k633nauMY4e6mFL10enm3pD165r-sHTcKNzc4g-tt1AM6NdW7eft-Ta6i9vJpc5Jvn8JZ8tomz1-jZ7ziItIIksE1ZVPKRjClZYy6xRkstUVpaniuGmLEsLATCoDJpKC614goFiwqDFMbk_vz2JFb2rG-2Oxb9gcRIMxMOZ6F33vTd-KHbd3rWhU4GcJ1JA8MU_LCdSEA</recordid><startdate>20241014</startdate><enddate>20241014</enddate><creator>Li, Wenhao</creator><creator>Wang, Qiangchang</creator><creator>Zhao, Peng</creator><creator>Yin, Yilong</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241014</creationdate><title>KNN Transformer with Pyramid Prompts for Few-Shot Learning</title><author>Li, Wenhao ; Wang, Qiangchang ; Zhao, Peng ; Yin, Yilong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a526-f05f8c1842392f5ff0fe871797cf19803dbbbf2184e38e3eca5a8163fe805e3f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Attention</topic><topic>Balances (scales)</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Context</topic><topic>Learning</topic><topic>Representations</topic><topic>Semantics</topic><topic>Transformers</topic><topic>Visual discrimination</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Wenhao</creatorcontrib><creatorcontrib>Wang, Qiangchang</creatorcontrib><creatorcontrib>Zhao, Peng</creatorcontrib><creatorcontrib>Yin, Yilong</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Wenhao</au><au>Wang, Qiangchang</au><au>Zhao, Peng</au><au>Yin, Yilong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>KNN Transformer with Pyramid Prompts for Few-Shot Learning</atitle><jtitle>arXiv.org</jtitle><date>2024-10-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Few-Shot Learning (FSL) aims to recognize new classes with limited labeled data. Recent studies have attempted to address the challenge of rare samples with textual prompts to modulate visual features. However, they usually struggle to capture complex semantic relationships between textual and visual features. Moreover, vanilla self-attention is heavily affected by useless information in images, severely constraining the potential of semantic priors in FSL due to the confusion of numerous irrelevant tokens during interaction. To address these aforementioned issues, a K-NN Transformer with Pyramid Prompts (KTPP) is proposed to select discriminative information with K-NN Context Attention (KCA) and adaptively modulate visual features with Pyramid Cross-modal Prompts (PCP). First, for each token, the KCA only selects the K most relevant tokens to compute the self-attention matrix and incorporates the mean of all tokens as the context prompt to provide the global context in three cascaded stages. As a result, irrelevant tokens can be progressively suppressed. Secondly, pyramid prompts are introduced in the PCP to emphasize visual features via interactions between text-based class-aware prompts and multi-scale visual features. This allows the ViT to dynamically adjust the importance weights of visual features based on rich semantic information at different scales, making models robust to spatial variations. Finally, augmented visual features and class-aware prompts are interacted via the KCA to extract class-specific features. Consequently, our model further enhances noise-free visual representations via deep cross-modal interactions, extracting generalized visual representation in scenarios with few labeled samples. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our method.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2410.10227</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-10 |
issn | 2331-8422 |
language | eng |
recordid | cdi_arxiv_primary_2410_10227 |
source | Freely Accessible Journals; arXiv.org |
subjects | Attention Balances (scales) Computer Science - Computer Vision and Pattern Recognition Context Learning Representations Semantics Transformers Visual discrimination |
title | KNN Transformer with Pyramid Prompts for Few-Shot Learning |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-05T06%3A27%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=KNN%20Transformer%20with%20Pyramid%20Prompts%20for%20Few-Shot%20Learning&rft.jtitle=arXiv.org&rft.au=Li,%20Wenhao&rft.date=2024-10-14&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2410.10227&rft_dat=%3Cproquest_arxiv%3E3116752026%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3116752026&rft_id=info:pmid/&rfr_iscdi=true |