Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis

Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they emplo...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on intelligent systems and technology 2024-12
Hauptverfasser:	Mu, Jie, Wang, Wei, Liu, Wenqi, Yan, Tiantian, Wang, Guanglu
Format:	Artikel
Sprache:	eng
Schlagworte:	Computing methodologies Natural language processing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	ACM transactions on intelligent systems and technology
container_volume
creator	Mu, Jie Wang, Wei Liu, Wenqi Yan, Tiantian Wang, Guanglu
description	Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they employ text and image pre-trained models trained in different unimodal tasks to extract different modal features, so that the extracted features cannot take into account the interactive information between image and text. To solve these problems, in this paper we propose a Vision-Language Contrastive Learning network (VLCLNet). First, we introduce a pre-trained Large Language Model (LLM), which is trained from vast quantities of multimodal data, has better understanding ability for image and text contents, thus being effectively applied to different tasks while requiring few amount of labelled training data. Second, we adapt a Multimodal Large Language Model (MLLM), BLIP-2 (Bootstrapping Language-Image Pre-training) network, to extract multimodal fusion feature. Such MLLM can fully consider the correlation between images and texts when extracting features. In addition, due to the discrepancy between the pre-training task and the sentiment analysis task, the pre-trained model will output the suboptimal prediction results. We use LoRA (Low-Rank Adaptation) fine-tuning strategy to update the model parameters on sentiment analysis task, which avoids the issue of inconsistent task between pre-training task and downstream task. Experiments verify that the proposed VLCLNet is superior to other strong baselines.
doi_str_mv	10.1145/3709147
format	Article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3709147</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3709147</sourcerecordid><originalsourceid>FETCH-LOGICAL-a517-47c8dcdcb273e801b76e1aa4d943d25b751f4270c8d2b2a225dfd2d3844f5a3a3</originalsourceid><addsrcrecordid>eNpNkM1Lw0AQxRdRsNTi3dPePEX3s5scS7EqpAga8Bgm2d24kmxkt0H637vSWpzDmwfzm3d4CF1TckepkPdckYIKdYZmjEqVLQvKzk-eiEu0iPGTpBEFK2g-Q-_bqd-5YdTQ4xJCZ5L6boJktqM2Pf52uw9cjq8rvHHeZNXkne-wHQP-9_lmfLJJ8MpDv48uXqELC300i-Oeo2rzUK2fsvLl8Xm9KjOQVGVCtbluddswxU1OaKOWhgIIXQiumWyUpFYwRRLFGgaMSW010zwXwkrgwOfo9hDbhjHGYGz9FdwAYV9TUv82Uh8bSeTNgYR2OEF_xx8SXVq5</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis</title><source>ACM Digital Library</source><creator>Mu, Jie ; Wang, Wei ; Liu, Wenqi ; Yan, Tiantian ; Wang, Guanglu</creator><creatorcontrib>Mu, Jie ; Wang, Wei ; Liu, Wenqi ; Yan, Tiantian ; Wang, Guanglu</creatorcontrib><description>Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they employ text and image pre-trained models trained in different unimodal tasks to extract different modal features, so that the extracted features cannot take into account the interactive information between image and text. To solve these problems, in this paper we propose a Vision-Language Contrastive Learning network (VLCLNet). First, we introduce a pre-trained Large Language Model (LLM), which is trained from vast quantities of multimodal data, has better understanding ability for image and text contents, thus being effectively applied to different tasks while requiring few amount of labelled training data. Second, we adapt a Multimodal Large Language Model (MLLM), BLIP-2 (Bootstrapping Language-Image Pre-training) network, to extract multimodal fusion feature. Such MLLM can fully consider the correlation between images and texts when extracting features. In addition, due to the discrepancy between the pre-training task and the sentiment analysis task, the pre-trained model will output the suboptimal prediction results. We use LoRA (Low-Rank Adaptation) fine-tuning strategy to update the model parameters on sentiment analysis task, which avoids the issue of inconsistent task between pre-training task and downstream task. Experiments verify that the proposed VLCLNet is superior to other strong baselines.</description><identifier>ISSN: 2157-6904</identifier><identifier>EISSN: 2157-6912</identifier><identifier>DOI: 10.1145/3709147</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Natural language processing</subject><ispartof>ACM transactions on intelligent systems and technology, 2024-12</ispartof><rights>Copyright held by the owner/author(s).</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a517-47c8dcdcb273e801b76e1aa4d943d25b751f4270c8d2b2a225dfd2d3844f5a3a3</cites><orcidid>0000-0002-0834-0355 ; 0000-0002-0811-9706 ; 0009-0009-1958-5110 ; 0000-0002-7207-2789 ; 0000-0001-8676-1190</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Mu, Jie</creatorcontrib><creatorcontrib>Wang, Wei</creatorcontrib><creatorcontrib>Liu, Wenqi</creatorcontrib><creatorcontrib>Yan, Tiantian</creatorcontrib><creatorcontrib>Wang, Guanglu</creatorcontrib><title>Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis</title><title>ACM transactions on intelligent systems and technology</title><addtitle>ACM TIST</addtitle><description>Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they employ text and image pre-trained models trained in different unimodal tasks to extract different modal features, so that the extracted features cannot take into account the interactive information between image and text. To solve these problems, in this paper we propose a Vision-Language Contrastive Learning network (VLCLNet). First, we introduce a pre-trained Large Language Model (LLM), which is trained from vast quantities of multimodal data, has better understanding ability for image and text contents, thus being effectively applied to different tasks while requiring few amount of labelled training data. Second, we adapt a Multimodal Large Language Model (MLLM), BLIP-2 (Bootstrapping Language-Image Pre-training) network, to extract multimodal fusion feature. Such MLLM can fully consider the correlation between images and texts when extracting features. In addition, due to the discrepancy between the pre-training task and the sentiment analysis task, the pre-trained model will output the suboptimal prediction results. We use LoRA (Low-Rank Adaptation) fine-tuning strategy to update the model parameters on sentiment analysis task, which avoids the issue of inconsistent task between pre-training task and downstream task. Experiments verify that the proposed VLCLNet is superior to other strong baselines.</description><subject>Computing methodologies</subject><subject>Natural language processing</subject><issn>2157-6904</issn><issn>2157-6912</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNpNkM1Lw0AQxRdRsNTi3dPePEX3s5scS7EqpAga8Bgm2d24kmxkt0H637vSWpzDmwfzm3d4CF1TckepkPdckYIKdYZmjEqVLQvKzk-eiEu0iPGTpBEFK2g-Q-_bqd-5YdTQ4xJCZ5L6boJktqM2Pf52uw9cjq8rvHHeZNXkne-wHQP-9_lmfLJJ8MpDv48uXqELC300i-Oeo2rzUK2fsvLl8Xm9KjOQVGVCtbluddswxU1OaKOWhgIIXQiumWyUpFYwRRLFGgaMSW010zwXwkrgwOfo9hDbhjHGYGz9FdwAYV9TUv82Uh8bSeTNgYR2OEF_xx8SXVq5</recordid><startdate>20241220</startdate><enddate>20241220</enddate><creator>Mu, Jie</creator><creator>Wang, Wei</creator><creator>Liu, Wenqi</creator><creator>Yan, Tiantian</creator><creator>Wang, Guanglu</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-0834-0355</orcidid><orcidid>https://orcid.org/0000-0002-0811-9706</orcidid><orcidid>https://orcid.org/0009-0009-1958-5110</orcidid><orcidid>https://orcid.org/0000-0002-7207-2789</orcidid><orcidid>https://orcid.org/0000-0001-8676-1190</orcidid></search><sort><creationdate>20241220</creationdate><title>Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis</title><author>Mu, Jie ; Wang, Wei ; Liu, Wenqi ; Yan, Tiantian ; Wang, Guanglu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a517-47c8dcdcb273e801b76e1aa4d943d25b751f4270c8d2b2a225dfd2d3844f5a3a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computing methodologies</topic><topic>Natural language processing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Mu, Jie</creatorcontrib><creatorcontrib>Wang, Wei</creatorcontrib><creatorcontrib>Liu, Wenqi</creatorcontrib><creatorcontrib>Yan, Tiantian</creatorcontrib><creatorcontrib>Wang, Guanglu</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on intelligent systems and technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Mu, Jie</au><au>Wang, Wei</au><au>Liu, Wenqi</au><au>Yan, Tiantian</au><au>Wang, Guanglu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis</atitle><jtitle>ACM transactions on intelligent systems and technology</jtitle><stitle>ACM TIST</stitle><date>2024-12-20</date><risdate>2024</risdate><issn>2157-6904</issn><eissn>2157-6912</eissn><abstract>Multimodal sentiment analysis has become a popular research topic in recent years. However, existing methods have two unaddressed limitations: (1) they use limited supervised labels to train models, which makes it impossible for model to fully learn sentiments in different modal data; (2) they employ text and image pre-trained models trained in different unimodal tasks to extract different modal features, so that the extracted features cannot take into account the interactive information between image and text. To solve these problems, in this paper we propose a Vision-Language Contrastive Learning network (VLCLNet). First, we introduce a pre-trained Large Language Model (LLM), which is trained from vast quantities of multimodal data, has better understanding ability for image and text contents, thus being effectively applied to different tasks while requiring few amount of labelled training data. Second, we adapt a Multimodal Large Language Model (MLLM), BLIP-2 (Bootstrapping Language-Image Pre-training) network, to extract multimodal fusion feature. Such MLLM can fully consider the correlation between images and texts when extracting features. In addition, due to the discrepancy between the pre-training task and the sentiment analysis task, the pre-trained model will output the suboptimal prediction results. We use LoRA (Low-Rank Adaptation) fine-tuning strategy to update the model parameters on sentiment analysis task, which avoids the issue of inconsistent task between pre-training task and downstream task. Experiments verify that the proposed VLCLNet is superior to other strong baselines.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3709147</doi><orcidid>https://orcid.org/0000-0002-0834-0355</orcidid><orcidid>https://orcid.org/0000-0002-0811-9706</orcidid><orcidid>https://orcid.org/0009-0009-1958-5110</orcidid><orcidid>https://orcid.org/0000-0002-7207-2789</orcidid><orcidid>https://orcid.org/0000-0001-8676-1190</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2157-6904
ispartof	ACM transactions on intelligent systems and technology, 2024-12
issn	2157-6904 2157-6912
language	eng
recordid	cdi_crossref_primary_10_1145_3709147
source	ACM Digital Library
subjects	Computing methodologies Natural language processing
title	Multimodal Large Language Model with LoRA Fine-Tuning for Multimodal Sentiment Analysis
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T04%3A21%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multimodal%20Large%20Language%20Model%20with%20LoRA%20Fine-Tuning%20for%20Multimodal%20Sentiment%20Analysis&rft.jtitle=ACM%20transactions%20on%20intelligent%20systems%20and%20technology&rft.au=Mu,%20Jie&rft.date=2024-12-20&rft.issn=2157-6904&rft.eissn=2157-6912&rft_id=info:doi/10.1145/3709147&rft_dat=%3Cacm_cross%3E3709147%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true