M$^2$IST: Multi-Modal Interactive Side-Tuning for Efficient Referring Expression Comprehension
Referring expression comprehension (REC) is a vision-language task to locate a target object in an image based on a language expression. Fully fine-tuning general-purpose pre-trained vision-language foundation models for REC yields impressive performance but becomes increasingly costly. Parameter-ef...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Referring expression comprehension (REC) is a vision-language task to locate
a target object in an image based on a language expression. Fully fine-tuning
general-purpose pre-trained vision-language foundation models for REC yields
impressive performance but becomes increasingly costly. Parameter-efficient
transfer learning (PETL) methods have shown strong performance with fewer
tunable parameters. However, directly applying PETL to REC faces two
challenges: (1) insufficient multi-modal interaction between pre-trained
vision-language foundation models, and (2) high GPU memory usage due to
gradients passing through the heavy vision-language foundation models. To this
end, we present M$^2$IST: Multi-Modal Interactive Side-Tuning with M$^3$ISAs:
Mixture of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep
the pre-trained uni-modal encoders fixed, updating M$^3$ISAs on side networks
to progressively connect them, enabling more comprehensive vision-language
alignment and efficient tuning for REC. Empirical results reveal that M$^2$IST
achieves an optimal balance between performance and efficiency compared to most
full fine-tuning and other PETL methods. With M$^2$IST, standard
transformer-based REC methods present competitive or even superior performance
compared to full fine-tuning, while utilizing only 2.11\% of the tunable
parameters, 39.61\% of the GPU memory, and 63.46\% of the fine-tuning time
required for full fine-tuning. |
---|---|
DOI: | 10.48550/arxiv.2407.01131 |