MVP: Meta Visual Prompt Tuning for Few-Shot Remote Sensing Image Scene Classification

Vision transformer (ViT) models have recently emerged as powerful and versatile tools for various visual tasks. In this article, we investigate ViT in a more challenging scenario within the context of few-shot conditions. Recent work has achieved promising results in few-shot image classification us...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on geoscience and remote sensing 2024, Vol.62, p.1-13
Hauptverfasser:	Zhu, Junjie, Li, Yiying, Yang, Ke, Guan, Naiyang, Fan, Zunlin, Qiu, Chunping, Yi, Xiaodong
Format:	Artikel
Sprache:	eng
Schlagworte:	Adaptation models Classification Data augmentation Data models Embedding Few-shot learning Image classification Information processing Mathematical models meta-learning Metalearning parameter-efficient fine-tuning Parameters prompt tuning Recombination Remote sensing Task analysis Transformers Tuning Visual tasks
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Vision transformer (ViT) models have recently emerged as powerful and versatile tools for various visual tasks. In this article, we investigate ViT in a more challenging scenario within the context of few-shot conditions. Recent work has achieved promising results in few-shot image classification using pretrained ViT models. However, this work uses full fine-tuning for the downstream tasks, leading to significant overfitting and storage issues, especially in the remote sensing domain. To tackle these issues, we turn to the recently proposed parameter-efficient tuning (PETuning) methods, which update only the newly added parameters while keeping the pretrained backbone frozen. Inspired by these methods, we propose the meta visual prompt tuning (MVP) method. Specifically, we integrate the prompt-tuning-based PETuning method into the meta-learning framework and tailor it for remote sensing datasets, resulting in an efficient framework for few-shot remote sensing scene classification (FS-RSSC). Moreover, we introduce a novel data augmentation scheme that exploits patch embedding recombination to enhance data diversity and quantity. This scheme is generalizable to any network that uses the ViT architecture as its backbone. Experimental results on the FS-RSSC benchmark demonstrate the superior performance of the proposed MVP over existing methods in various settings, including various-way-various-shot, various-way-one-shot, and cross-domain adaptation.
ISSN:	0196-2892 1558-0644
DOI:	10.1109/TGRS.2024.3359599