Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis
Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they emplo...
Gespeichert in:
Veröffentlicht in: | ACM transactions on intelligent systems and technology 2024-12 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they employ text and image pre-trained models trained in different unimodal tasks to extract different modal features, so that the extracted features cannot take into account the interactive information between image and text. To solve these problems, in this paper we propose a Vision-Language Contrastive Learning network (VLCLNet). First, we introduce a pre-trained Large Language Model (LLM), which is trained from vast quantities of multimodal data, has better understanding ability for image and text contents, thus being effectively applied to different tasks while requiring few amount of labelled training data. Second, we adapt a Multimodal Large Language Model (MLLM), BLIP-2 (Bootstrapping Language-Image Pre-training) network, to extract multimodal fusion feature. Such MLLM can fully consider the correlation between images and texts when extracting features. In addition, due to the discrepancy between the pre-training task and the sentiment analysis task, the pre-trained model will output the suboptimal prediction results. We use LoRA (Low-Rank Adaptation) fine-tuning strategy to update the model parameters on sentiment analysis task, which avoids the issue of inconsistent task between pre-training task and downstream task. Experiments verify that the proposed VLCLNet is superior to other strong baselines. |
---|---|
ISSN: | 2157-6904 2157-6912 |
DOI: | 10.1145/3709147 |