LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs
Instruction finetuning on a variety of image-text instruction data is the key to obtaining a versatile Multimodal Large Language Model (MLLM), and different configurations of the instruction data can lead to finetuned models with different capabilities. However, we have discovered that data conflict...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Instruction finetuning on a variety of image-text instruction data is the key
to obtaining a versatile Multimodal Large Language Model (MLLM), and different
configurations of the instruction data can lead to finetuned models with
different capabilities. However, we have discovered that data conflicts are
inevitable when mixing instruction data from distinct domains, which can result
in performance drops for tasks of a specific domain. To address this issue, we
propose to apply an efficient Mixture of Experts (MoE) design, which is a
sparse Mixture of LoRA Experts (MoLE) for instruction finetuning MLLMs. Within
the Transformer layers, we extend the popular Low-Rank Adaption (LoRA) method
by creating a set of LoRA experts specifically for the MLP layer, and route
each token to the top-1 expert based on a routing function, allowing adaptive
choices for tokens from different domains. Since the LoRA experts are sparsely
activated, the training and inference cost are kept roughly constant compared
to the original LoRA method. By replacing the plain-LoRA of LLaVA-1.5 with our
MoE design, our final model is named LLaVA-MoLE. Extensive experiments proved
that LLaVA-MoLE effectively mitigates the data conflict issue when mixing
multiple distinct instruction datasets with various configurations, and
achieves consistent performance gains over the strong plain-LoRA baselines.
Most importantly, on the mixed datasets, LLaVA-MoLE can even outperform the
plain-LoRA baseline trained with twice the samples. |
---|---|
DOI: | 10.48550/arxiv.2401.16160 |