Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis

Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they emplo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on intelligent systems and technology 2024-12
Hauptverfasser: Mu, Jie, Wang, Wei, Liu, Wenqi, Yan, Tiantian, Wang, Guanglu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title ACM transactions on intelligent systems and technology
container_volume
creator Mu, Jie
Wang, Wei
Liu, Wenqi
Yan, Tiantian
Wang, Guanglu
description Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they employ text and image pre-trained models trained in different unimodal tasks to extract different modal features, so that the extracted features cannot take into account the interactive information between image and text. To solve these problems, in this paper we propose a Vision-Language Contrastive Learning network (VLCLNet). First, we introduce a pre-trained Large Language Model (LLM), which is trained from vast quantities of multimodal data, has better understanding ability for image and text contents, thus being effectively applied to different tasks while requiring few amount of labelled training data. Second, we adapt a Multimodal Large Language Model (MLLM), BLIP-2 (Bootstrapping Language-Image Pre-training) network, to extract multimodal fusion feature. Such MLLM can fully consider the correlation between images and texts when extracting features. In addition, due to the discrepancy between the pre-training task and the sentiment analysis task, the pre-trained model will output the suboptimal prediction results. We use LoRA (Low-Rank Adaptation) fine-tuning strategy to update the model parameters on sentiment analysis task, which avoids the issue of inconsistent task between pre-training task and downstream task. Experiments verify that the proposed VLCLNet is superior to other strong baselines.
doi_str_mv 10.1145/3709147
format Article
fullrecord <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3709147</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3709147</sourcerecordid><originalsourceid>FETCH-LOGICAL-a517-47c8dcdcb273e801b76e1aa4d943d25b751f4270c8d2b2a225dfd2d3844f5a3a3</originalsourceid><addsrcrecordid>eNpNkM1Lw0AQxRdRsNTi3dPePEX3s5scS7EqpAga8Bgm2d24kmxkt0H637vSWpzDmwfzm3d4CF1TckepkPdckYIKdYZmjEqVLQvKzk-eiEu0iPGTpBEFK2g-Q-_bqd-5YdTQ4xJCZ5L6boJktqM2Pf52uw9cjq8rvHHeZNXkne-wHQP-9_lmfLJJ8MpDv48uXqELC300i-Oeo2rzUK2fsvLl8Xm9KjOQVGVCtbluddswxU1OaKOWhgIIXQiumWyUpFYwRRLFGgaMSW010zwXwkrgwOfo9hDbhjHGYGz9FdwAYV9TUv82Uh8bSeTNgYR2OEF_xx8SXVq5</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis</title><source>ACM Digital Library</source><creator>Mu, Jie ; Wang, Wei ; Liu, Wenqi ; Yan, Tiantian ; Wang, Guanglu</creator><creatorcontrib>Mu, Jie ; Wang, Wei ; Liu, Wenqi ; Yan, Tiantian ; Wang, Guanglu</creatorcontrib><description>Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they employ text and image pre-trained models trained in different unimodal tasks to extract different modal features, so that the extracted features cannot take into account the interactive information between image and text. To solve these problems, in this paper we propose a Vision-Language Contrastive Learning network (VLCLNet). First, we introduce a pre-trained Large Language Model (LLM), which is trained from vast quantities of multimodal data, has better understanding ability for image and text contents, thus being effectively applied to different tasks while requiring few amount of labelled training data. Second, we adapt a Multimodal Large Language Model (MLLM), BLIP-2 (Bootstrapping Language-Image Pre-training) network, to extract multimodal fusion feature. Such MLLM can fully consider the correlation between images and texts when extracting features. In addition, due to the discrepancy between the pre-training task and the sentiment analysis task, the pre-trained model will output the suboptimal prediction results. We use LoRA (Low-Rank Adaptation) fine-tuning strategy to update the model parameters on sentiment analysis task, which avoids the issue of inconsistent task between pre-training task and downstream task. Experiments verify that the proposed VLCLNet is superior to other strong baselines.</description><identifier>ISSN: 2157-6904</identifier><identifier>EISSN: 2157-6912</identifier><identifier>DOI: 10.1145/3709147</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Natural language processing</subject><ispartof>ACM transactions on intelligent systems and technology, 2024-12</ispartof><rights>Copyright held by the owner/author(s).</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a517-47c8dcdcb273e801b76e1aa4d943d25b751f4270c8d2b2a225dfd2d3844f5a3a3</cites><orcidid>0000-0002-0834-0355 ; 0000-0002-0811-9706 ; 0009-0009-1958-5110 ; 0000-0002-7207-2789 ; 0000-0001-8676-1190</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Mu, Jie</creatorcontrib><creatorcontrib>Wang, Wei</creatorcontrib><creatorcontrib>Liu, Wenqi</creatorcontrib><creatorcontrib>Yan, Tiantian</creatorcontrib><creatorcontrib>Wang, Guanglu</creatorcontrib><title>Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis</title><title>ACM transactions on intelligent systems and technology</title><addtitle>ACM TIST</addtitle><description>Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they employ text and image pre-trained models trained in different unimodal tasks to extract different modal features, so that the extracted features cannot take into account the interactive information between image and text. To solve these problems, in this paper we propose a Vision-Language Contrastive Learning network (VLCLNet). First, we introduce a pre-trained Large Language Model (LLM), which is trained from vast quantities of multimodal data, has better understanding ability for image and text contents, thus being effectively applied to different tasks while requiring few amount of labelled training data. Second, we adapt a Multimodal Large Language Model (MLLM), BLIP-2 (Bootstrapping Language-Image Pre-training) network, to extract multimodal fusion feature. Such MLLM can fully consider the correlation between images and texts when extracting features. In addition, due to the discrepancy between the pre-training task and the sentiment analysis task, the pre-trained model will output the suboptimal prediction results. We use LoRA (Low-Rank Adaptation) fine-tuning strategy to update the model parameters on sentiment analysis task, which avoids the issue of inconsistent task between pre-training task and downstream task. Experiments verify that the proposed VLCLNet is superior to other strong baselines.</description><subject>Computing methodologies</subject><subject>Natural language processing</subject><issn>2157-6904</issn><issn>2157-6912</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkM1Lw0AQxRdRsNTi3dPePEX3s5scS7EqpAga8Bgm2d24kmxkt0H637vSWpzDmwfzm3d4CF1TckepkPdckYIKdYZmjEqVLQvKzk-eiEu0iPGTpBEFK2g-Q-_bqd-5YdTQ4xJCZ5L6boJktqM2Pf52uw9cjq8rvHHeZNXkne-wHQP-9_lmfLJJ8MpDv48uXqELC300i-Oeo2rzUK2fsvLl8Xm9KjOQVGVCtbluddswxU1OaKOWhgIIXQiumWyUpFYwRRLFGgaMSW010zwXwkrgwOfo9hDbhjHGYGz9FdwAYV9TUv82Uh8bSeTNgYR2OEF_xx8SXVq5</recordid><startdate>20241220</startdate><enddate>20241220</enddate><creator>Mu, Jie</creator><creator>Wang, Wei</creator><creator>Liu, Wenqi</creator><creator>Yan, Tiantian</creator><creator>Wang, Guanglu</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-0834-0355</orcidid><orcidid>https://orcid.org/0000-0002-0811-9706</orcidid><orcidid>https://orcid.org/0009-0009-1958-5110</orcidid><orcidid>https://orcid.org/0000-0002-7207-2789</orcidid><orcidid>https://orcid.org/0000-0001-8676-1190</orcidid></search><sort><creationdate>20241220</creationdate><title>Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis</title><author>Mu, Jie ; Wang, Wei ; Liu, Wenqi ; Yan, Tiantian ; Wang, Guanglu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a517-47c8dcdcb273e801b76e1aa4d943d25b751f4270c8d2b2a225dfd2d3844f5a3a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computing methodologies</topic><topic>Natural language processing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mu, Jie</creatorcontrib><creatorcontrib>Wang, Wei</creatorcontrib><creatorcontrib>Liu, Wenqi</creatorcontrib><creatorcontrib>Yan, Tiantian</creatorcontrib><creatorcontrib>Wang, Guanglu</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on intelligent systems and technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mu, Jie</au><au>Wang, Wei</au><au>Liu, Wenqi</au><au>Yan, Tiantian</au><au>Wang, Guanglu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis</atitle><jtitle>ACM transactions on intelligent systems and technology</jtitle><stitle>ACM TIST</stitle><date>2024-12-20</date><risdate>2024</risdate><issn>2157-6904</issn><eissn>2157-6912</eissn><abstract>Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they employ text and image pre-trained models trained in different unimodal tasks to extract different modal features, so that the extracted features cannot take into account the interactive information between image and text. To solve these problems, in this paper we propose a Vision-Language Contrastive Learning network (VLCLNet). First, we introduce a pre-trained Large Language Model (LLM), which is trained from vast quantities of multimodal data, has better understanding ability for image and text contents, thus being effectively applied to different tasks while requiring few amount of labelled training data. Second, we adapt a Multimodal Large Language Model (MLLM), BLIP-2 (Bootstrapping Language-Image Pre-training) network, to extract multimodal fusion feature. Such MLLM can fully consider the correlation between images and texts when extracting features. In addition, due to the discrepancy between the pre-training task and the sentiment analysis task, the pre-trained model will output the suboptimal prediction results. We use LoRA (Low-Rank Adaptation) fine-tuning strategy to update the model parameters on sentiment analysis task, which avoids the issue of inconsistent task between pre-training task and downstream task. Experiments verify that the proposed VLCLNet is superior to other strong baselines.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3709147</doi><orcidid>https://orcid.org/0000-0002-0834-0355</orcidid><orcidid>https://orcid.org/0000-0002-0811-9706</orcidid><orcidid>https://orcid.org/0009-0009-1958-5110</orcidid><orcidid>https://orcid.org/0000-0002-7207-2789</orcidid><orcidid>https://orcid.org/0000-0001-8676-1190</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2157-6904
ispartof ACM transactions on intelligent systems and technology, 2024-12
issn 2157-6904
2157-6912
language eng
recordid cdi_crossref_primary_10_1145_3709147
source ACM Digital Library
subjects Computing methodologies
Natural language processing
title Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T04%3A21%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multimodal%20Large%20Language%20Model%20with%20LoRA%20Fine-Tuning%20for%20Multimodal%20Sentiment%20Analysis&rft.jtitle=ACM%20transactions%20on%20intelligent%20systems%20and%20technology&rft.au=Mu,%20Jie&rft.date=2024-12-20&rft.issn=2157-6904&rft.eissn=2157-6912&rft_id=info:doi/10.1145/3709147&rft_dat=%3Cacm_cross%3E3709147%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true