Efficient Transfer Learning for Video-language Foundation Models
Pre-trained vision-language models provide a robust foundation for efficient transfer learning across various downstream tasks. In the field of video action recognition, mainstream approaches often introduce additional parameter modules to capture temporal information. While the increased model capa...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Pre-trained vision-language models provide a robust foundation for efficient
transfer learning across various downstream tasks. In the field of video action
recognition, mainstream approaches often introduce additional parameter modules
to capture temporal information. While the increased model capacity brought by
these additional parameters helps better fit the video-specific inductive
biases, existing methods require learning a large number of parameters and are
prone to catastrophic forgetting of the original generalizable knowledge. In
this paper, we propose a simple yet effective Multi-modal Spatio-Temporal
Adapter (MSTA) to improve the alignment between representations in the text and
vision branches, achieving a balance between general knowledge and
task-specific knowledge. Furthermore, to mitigate over-fitting and enhance
generalizability, we introduce a spatio-temporal description-guided consistency
constraint. This constraint involves feeding template inputs (i.e., ``a video
of $\{\textbf{cls}\}$'') into the trainable language branch, while
LLM-generated spatio-temporal descriptions are input into the pre-trained
language branch, enforcing consistency between the outputs of the two branches.
This mechanism prevents over-fitting to downstream tasks and improves the
distinguishability of the trainable branch within the spatio-temporal semantic
space. We evaluate the effectiveness of our approach across four tasks:
zero-shot transfer, few-shot learning, base-to-novel generalization, and
fully-supervised learning. Compared to many state-of-the-art methods, our MSTA
achieves outstanding performance across all evaluations, while using only 2-7\%
of the trainable parameters in the original model. Code will be avaliable at
https://github.com/chenhaoxing/ETL4Video. |
---|---|
DOI: | 10.48550/arxiv.2411.11223 |