Pro-Tuning: Unified Prompt Tuning for Vision Tasks

In computer vision, fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks. However, deploying it in practice is quite challenging, due to adopting parameter inefficient global update and heavily relying on high-quality downstream data. Recently, promp...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2024-06, Vol.34 (6), p.4653-4667
Hauptverfasser: Nie, Xing, Ni, Bolin, Chang, Jianlong, Meng, Gaofeng, Huo, Chunlei, Xiang, Shiming, Tian, Qi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In computer vision, fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks. However, deploying it in practice is quite challenging, due to adopting parameter inefficient global update and heavily relying on high-quality downstream data. Recently, prompt-based learning, which adds the task-relevant prompt to adapt the pre-trained models to downstream tasks, has drastically boosted the performance of many natural language downstream tasks. In this work, we extend this notable transfer ability benefited from prompt into vision models as an alternative to fine-tuning. To this end, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt diverse frozen pre-trained models to a wide variety of downstream vision tasks. The key to Pro-tuning is prompt-based tuning, i.e., learning task-specific vision prompts for downstream input images with the pre-trained model frozen. By only training a small number of additional parameters, Pro-tuning can generate compact and robust downstream models both for CNN-based and transformer-based network architectures. Comprehensive experiments evidence that the proposed Pro-tuning outperforms fine-tuning on a broad range of vision tasks and scenarios, including image classification (under generic objects, class imbalance, image corruption, natural adversarial examples, and out-of-distribution generalization), and dense prediction tasks such as object detection and semantic segmentation.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2023.3327605