MoIL: Momentum Imitation Learning for Efficient Vision-Language Adaptation

Pre-training and fine-tuning have been the de-facto paradigm in vision-language domains. Along with the rapid growth of model sizes, fully fine-tuning these large-scale vision-language pre-training (VLP) models requires prohibitively expensive storage costs. To address this issue, recent advances in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence 2024-07, Vol.PP, p.1-13
Hauptverfasser: Luo, Gen, Zhou, Yiyi, Huang, Minglang, Ren, Tianhe, Sun, Xiaoshuai, Ji, Rongrong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Pre-training and fine-tuning have been the de-facto paradigm in vision-language domains. Along with the rapid growth of model sizes, fully fine-tuning these large-scale vision-language pre-training (VLP) models requires prohibitively expensive storage costs. To address this issue, recent advances in NLP offer a promising and efficient adaptation approach called LoRA, which aims to approximate the fine-tuning of large pre-trained model by updating low-rank parameters. Despite its effectiveness, we identify that LoRA suffers a large approximation error on VLP models and its optimization is also inefficient, which greatly limits its performance upper bound. In this paper, we mathematically prove that the approximation error of low-rank adaptation can be optimized by a new optimization objective, i.e., the weight distance between LoRA and fine-tuning. Based on this finding, we propose a novel PETL method for VLP models, namely momentum imitation learning (MoIL). Specifically, MoIL formulates PETL as a weight imitation learning process and directly optimize the approximation error bound of the low-rank adaptation. Based on this training scheme, we also explore a new hybrid approximation function to reduce the learning difficulty of low-rank adaptations. With these two novel designs, MoIL can greatly improve the optimization efficiency of the low-rank parameters on VLP models. We validate MoIL on three VLP models ranging from end-to-end network to two-stage network, and conduct extensive experiments on four VL tasks. Experimental results demonstrate superior performance and optimization efficiency of MoIL than existing PETL methods. For instance, by updating only 6.23% parameters, MoIL can even outperform full tuning by +2.3% on image-text matching task. Meanwhile, its inference efficiency and generalization ability is also validated by multiple VLP models, e.g., VLMO and VinVL.
ISSN:0162-8828
1939-3539
1939-3539
2160-9292
DOI:10.1109/TPAMI.2024.3435790