Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning

In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deploymen...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Liang, Zhengyang, Liang, Meiyu, Huang, Wei, Li, Yawen, Xue, Zhe
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Multimedia
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Liang, Zhengyang Liang, Meiyu Huang, Wei Li, Yawen Xue, Zhe
description	In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.
doi_str_mv	10.48550/arxiv.2404.10838
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2404_10838</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404_10838</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-1a6935c293541468bde75202db65a16071ead776cfa16eaa9ed3446032a1b9a33</originalsourceid><addsrcrecordid>eNotkM1OwzAQhH3hgAoPwAm_QIIdO05yRGn5kVKBoPdoE68rS04cOaaiT8BrExous5rR7Bw-Qu44S2WZ5-wBwrc9pZlkMuWsFOU1-dmeRxhsTz_RmQQ0TNGekO6_XLRzDw7p1s7ROgfR-pGa4Af6HjCJAeyIei0OXoOjDYTj8uk1Omp8oDtjbG9xjLQOfp6TtfWBU8B5SdfBBiGMdjzekCsDbsbb_7shh6fdoX5Jmrfn1_qxSUAVZcJBVSLvs0Ukl6rsNBZ5xjLdqRy4YgVH0EWherM4BKhQCykVExnwrgIhNuR-nb2QaKdgBwjn9o9IeyEifgF3LF4I</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning</title><source>arXiv.org</source><creator>Liang, Zhengyang ; Liang, Meiyu ; Huang, Wei ; Li, Yawen ; Xue, Zhe</creator><creatorcontrib>Liang, Zhengyang ; Liang, Meiyu ; Huang, Wei ; Li, Yawen ; Xue, Zhe</creatorcontrib><description>In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.</description><identifier>DOI: 10.48550/arxiv.2404.10838</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Multimedia</subject><creationdate>2024-04</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2404.10838$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2404.10838$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Liang, Zhengyang</creatorcontrib><creatorcontrib>Liang, Meiyu</creatorcontrib><creatorcontrib>Huang, Wei</creatorcontrib><creatorcontrib>Li, Yawen</creatorcontrib><creatorcontrib>Xue, Zhe</creatorcontrib><title>Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning</title><description>In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Multimedia</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotkM1OwzAQhH3hgAoPwAm_QIIdO05yRGn5kVKBoPdoE68rS04cOaaiT8BrExous5rR7Bw-Qu44S2WZ5-wBwrc9pZlkMuWsFOU1-dmeRxhsTz_RmQQ0TNGekO6_XLRzDw7p1s7ROgfR-pGa4Af6HjCJAeyIei0OXoOjDYTj8uk1Omp8oDtjbG9xjLQOfp6TtfWBU8B5SdfBBiGMdjzekCsDbsbb_7shh6fdoX5Jmrfn1_qxSUAVZcJBVSLvs0Ukl6rsNBZ5xjLdqRy4YgVH0EWherM4BKhQCykVExnwrgIhNuR-nb2QaKdgBwjn9o9IeyEifgF3LF4I</recordid><startdate>20240416</startdate><enddate>20240416</enddate><creator>Liang, Zhengyang</creator><creator>Liang, Meiyu</creator><creator>Huang, Wei</creator><creator>Li, Yawen</creator><creator>Xue, Zhe</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240416</creationdate><title>Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning</title><author>Liang, Zhengyang ; Liang, Meiyu ; Huang, Wei ; Li, Yawen ; Xue, Zhe</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-1a6935c293541468bde75202db65a16071ead776cfa16eaa9ed3446032a1b9a33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Multimedia</topic><toplevel>online_resources</toplevel><creatorcontrib>Liang, Zhengyang</creatorcontrib><creatorcontrib>Liang, Meiyu</creatorcontrib><creatorcontrib>Huang, Wei</creatorcontrib><creatorcontrib>Li, Yawen</creatorcontrib><creatorcontrib>Xue, Zhe</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liang, Zhengyang</au><au>Liang, Meiyu</au><au>Huang, Wei</au><au>Li, Yawen</au><au>Xue, Zhe</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning</atitle><date>2024-04-16</date><risdate>2024</risdate><abstract>In recent years, pre-trained multimodal large models have attracted widespread attention due to their outstanding performance in various multimodal applications. Nonetheless, the extensive computational resources and vast datasets required for their training present significant hurdles for deployment in environments with limited computational resources. To address this challenge, we propose a novel dynamic self-adaptive multiscale distillation from pre-trained multimodal large model for efficient cross-modal representation learning for the first time. Unlike existing distillation methods, our strategy employs a multiscale perspective, enabling the extraction structural knowledge across from the pre-trained multimodal large model. Ensuring that the student model inherits a comprehensive and nuanced understanding of the teacher knowledge. To optimize each distillation loss in a balanced and efficient manner, we propose a dynamic self-adaptive distillation loss balancer, a novel component eliminating the need for manual loss weight adjustments and dynamically balances each loss item during the distillation process. Our methodology streamlines pre-trained multimodal large models using only their output features and original image-level information, requiring minimal computational resources. This efficient approach is suited for various applications and allows the deployment of advanced multimodal technologies even in resource-limited settings. Extensive experiments has demonstrated that our method maintains high performance while significantly reducing model complexity and training costs. Moreover, our distilled student model utilizes only image-level information to achieve state-of-the-art performance on cross-modal retrieval tasks, surpassing previous methods that relied on region-level information.</abstract><doi>10.48550/arxiv.2404.10838</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2404.10838
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2404_10838
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Multimedia
title	Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T17%3A22%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Dynamic%20Self-adaptive%20Multiscale%20Distillation%20from%20Pre-trained%20Multimodal%20Large%20Model%20for%20Efficient%20Cross-modal%20Representation%20Learning&rft.au=Liang,%20Zhengyang&rft.date=2024-04-16&rft_id=info:doi/10.48550/arxiv.2404.10838&rft_dat=%3Carxiv_GOX%3E2404_10838%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true