Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning
Vision Language Models (VLMs), pre-trained on large-scale image-text datasets, enable zero-shot predictions for unseen data but may underperform on specific unseen tasks. Continual learning (CL) can help VLMs effectively adapt to new data distributions without joint training, but faces challenges of...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vision Language Models (VLMs), pre-trained on large-scale image-text
datasets, enable zero-shot predictions for unseen data but may underperform on
specific unseen tasks. Continual learning (CL) can help VLMs effectively adapt
to new data distributions without joint training, but faces challenges of
catastrophic forgetting and generalization forgetting. Although significant
progress has been achieved by distillation-based methods, they exhibit two
severe limitations. One is the popularly adopted single-teacher paradigm fails
to impart comprehensive knowledge, The other is the existing methods
inadequately leverage the multimodal information in the original training
dataset, instead they rely on additional data for distillation, which increases
computational and storage overhead. To mitigate both limitations, by drawing on
Knowledge Integration Theory (KIT), we propose a Multi-Stage Knowledge
Integration network (MulKI) to emulate the human learning process in
distillation methods. MulKI achieves this through four stages, including
Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making
Connections. During the four stages, we first leverage prototypes to align
across modalities, eliciting cross-modal knowledge, then adding new knowledge
by constructing fine-grained intra- and inter-modality relationships with
prototypes. After that, knowledge from two teacher models is adaptively
distinguished and re-weighted. Finally, we connect between models from intra-
and inter-task, integrating preceding and new knowledge. Our method
demonstrates significant improvements in maintaining zero-shot capabilities
while supporting continual learning across diverse downstream tasks, showcasing
its potential in adapting VLMs to evolving data distributions. |
---|---|
DOI: | 10.48550/arxiv.2411.06764 |