ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models

Large pre-trained vision-language models have shown great prominence in transferring pre-acquired knowledge to various domains and downstream tasks with appropriate prompting or tuning. Existing prevalent tuning methods can be generally categorized into three genres: 1) prompt engineering by creatin...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yi, Huahui, Qin, Ziyuan, Xu, Wei, Guo, Miaotian, Wang, Kun, Zhang, Shaoting, Li, Kang, Lao, Qicheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Yi, Huahui Qin, Ziyuan Xu, Wei Guo, Miaotian Wang, Kun Zhang, Shaoting Li, Kang Lao, Qicheng
description	Large pre-trained vision-language models have shown great prominence in transferring pre-acquired knowledge to various domains and downstream tasks with appropriate prompting or tuning. Existing prevalent tuning methods can be generally categorized into three genres: 1) prompt engineering by creating suitable prompt texts, which is time-consuming and requires domain expertise; 2) or simply fine-tuning the whole model, which is extremely inefficient; 3) prompt tuning through parameterized prompt embeddings with the text encoder. Nevertheless, all methods rely on the text encoder for bridging the modality gap between vision and language. In this work, we question the necessity of the cumbersome text encoder for a more lightweight and efficient tuning paradigm as well as more representative prompt embeddings closer to the image representations. To achieve this, we propose a Concept Embedding Search (ConES) approach by optimizing prompt embeddings -- without the need of the text encoder -- to capture the 'concept' of the image modality through a variety of task objectives. By dropping the text encoder, we are able to significantly speed up the learning process, \eg, from about an hour to just ten minutes in our experiments for personalized text-to-image generation without impairing the generation quality. Moreover, our proposed approach is orthogonal to current existing tuning methods since the searched concept embeddings can be further utilized in the next stage of fine-tuning the pre-trained large models for boosting performance. Extensive experiments show that our approach can beat the prompt tuning and textual inversion methods in a variety of downstream tasks including objection detection, instance segmentation, and image generation. Our approach also shows better generalization capability for unseen concepts in specialized domains, such as the medical domain.
doi_str_mv	10.48550/arxiv.2305.18993
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2305_18993</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2305_18993</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-8a7958fc86b59ab9bcf3f2ef4fa886f2f2dcbb36f4434b2ac1ebd8edc3f4de6b3</originalsourceid><addsrcrecordid>eNotj81KxDAURrNxIaMP4Mq8QGvbpJ3UnZT6Ax0UpsxmFuUmubcGpumQdkTf3s7o6nDg44PD2F2axFLlefIA4dt9xZlI8jhVZSmu2b4afb195AsMHmdeDxqtdb7nW4RgPjmNgX9AgAFnDLwmcsahn3l78udVA6FHvnOTG_0ivj_B4pvR4mG6YVcEhwlv_7li7XPdVq9R8_7yVj01ERRrESlYl7kiowqdl6BLbUhQhiQJlCooo8warUVBUgqpMzApaqvQGkHSYqHFit3_3V7qumNwA4Sf7lzZXSrFL9VSTrE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models</title><source>arXiv.org</source><creator>Yi, Huahui ; Qin, Ziyuan ; Xu, Wei ; Guo, Miaotian ; Wang, Kun ; Zhang, Shaoting ; Li, Kang ; Lao, Qicheng</creator><creatorcontrib>Yi, Huahui ; Qin, Ziyuan ; Xu, Wei ; Guo, Miaotian ; Wang, Kun ; Zhang, Shaoting ; Li, Kang ; Lao, Qicheng</creatorcontrib><description>Large pre-trained vision-language models have shown great prominence in transferring pre-acquired knowledge to various domains and downstream tasks with appropriate prompting or tuning. Existing prevalent tuning methods can be generally categorized into three genres: 1) prompt engineering by creating suitable prompt texts, which is time-consuming and requires domain expertise; 2) or simply fine-tuning the whole model, which is extremely inefficient; 3) prompt tuning through parameterized prompt embeddings with the text encoder. Nevertheless, all methods rely on the text encoder for bridging the modality gap between vision and language. In this work, we question the necessity of the cumbersome text encoder for a more lightweight and efficient tuning paradigm as well as more representative prompt embeddings closer to the image representations. To achieve this, we propose a Concept Embedding Search (ConES) approach by optimizing prompt embeddings -- without the need of the text encoder -- to capture the 'concept' of the image modality through a variety of task objectives. By dropping the text encoder, we are able to significantly speed up the learning process, \eg, from about an hour to just ten minutes in our experiments for personalized text-to-image generation without impairing the generation quality. Moreover, our proposed approach is orthogonal to current existing tuning methods since the searched concept embeddings can be further utilized in the next stage of fine-tuning the pre-trained large models for boosting performance. Extensive experiments show that our approach can beat the prompt tuning and textual inversion methods in a variety of downstream tasks including objection detection, instance segmentation, and image generation. Our approach also shows better generalization capability for unseen concepts in specialized domains, such as the medical domain.</description><identifier>DOI: 10.48550/arxiv.2305.18993</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2305.18993$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.18993$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Yi, Huahui</creatorcontrib><creatorcontrib>Qin, Ziyuan</creatorcontrib><creatorcontrib>Xu, Wei</creatorcontrib><creatorcontrib>Guo, Miaotian</creatorcontrib><creatorcontrib>Wang, Kun</creatorcontrib><creatorcontrib>Zhang, Shaoting</creatorcontrib><creatorcontrib>Li, Kang</creatorcontrib><creatorcontrib>Lao, Qicheng</creatorcontrib><title>ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models</title><description>Large pre-trained vision-language models have shown great prominence in transferring pre-acquired knowledge to various domains and downstream tasks with appropriate prompting or tuning. Existing prevalent tuning methods can be generally categorized into three genres: 1) prompt engineering by creating suitable prompt texts, which is time-consuming and requires domain expertise; 2) or simply fine-tuning the whole model, which is extremely inefficient; 3) prompt tuning through parameterized prompt embeddings with the text encoder. Nevertheless, all methods rely on the text encoder for bridging the modality gap between vision and language. In this work, we question the necessity of the cumbersome text encoder for a more lightweight and efficient tuning paradigm as well as more representative prompt embeddings closer to the image representations. To achieve this, we propose a Concept Embedding Search (ConES) approach by optimizing prompt embeddings -- without the need of the text encoder -- to capture the 'concept' of the image modality through a variety of task objectives. By dropping the text encoder, we are able to significantly speed up the learning process, \eg, from about an hour to just ten minutes in our experiments for personalized text-to-image generation without impairing the generation quality. Moreover, our proposed approach is orthogonal to current existing tuning methods since the searched concept embeddings can be further utilized in the next stage of fine-tuning the pre-trained large models for boosting performance. Extensive experiments show that our approach can beat the prompt tuning and textual inversion methods in a variety of downstream tasks including objection detection, instance segmentation, and image generation. Our approach also shows better generalization capability for unseen concepts in specialized domains, such as the medical domain.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81KxDAURrNxIaMP4Mq8QGvbpJ3UnZT6Ax0UpsxmFuUmubcGpumQdkTf3s7o6nDg44PD2F2axFLlefIA4dt9xZlI8jhVZSmu2b4afb195AsMHmdeDxqtdb7nW4RgPjmNgX9AgAFnDLwmcsahn3l78udVA6FHvnOTG_0ivj_B4pvR4mG6YVcEhwlv_7li7XPdVq9R8_7yVj01ERRrESlYl7kiowqdl6BLbUhQhiQJlCooo8warUVBUgqpMzApaqvQGkHSYqHFit3_3V7qumNwA4Sf7lzZXSrFL9VSTrE</recordid><startdate>20230530</startdate><enddate>20230530</enddate><creator>Yi, Huahui</creator><creator>Qin, Ziyuan</creator><creator>Xu, Wei</creator><creator>Guo, Miaotian</creator><creator>Wang, Kun</creator><creator>Zhang, Shaoting</creator><creator>Li, Kang</creator><creator>Lao, Qicheng</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230530</creationdate><title>ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models</title><author>Yi, Huahui ; Qin, Ziyuan ; Xu, Wei ; Guo, Miaotian ; Wang, Kun ; Zhang, Shaoting ; Li, Kang ; Lao, Qicheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-8a7958fc86b59ab9bcf3f2ef4fa886f2f2dcbb36f4434b2ac1ebd8edc3f4de6b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Yi, Huahui</creatorcontrib><creatorcontrib>Qin, Ziyuan</creatorcontrib><creatorcontrib>Xu, Wei</creatorcontrib><creatorcontrib>Guo, Miaotian</creatorcontrib><creatorcontrib>Wang, Kun</creatorcontrib><creatorcontrib>Zhang, Shaoting</creatorcontrib><creatorcontrib>Li, Kang</creatorcontrib><creatorcontrib>Lao, Qicheng</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yi, Huahui</au><au>Qin, Ziyuan</au><au>Xu, Wei</au><au>Guo, Miaotian</au><au>Wang, Kun</au><au>Zhang, Shaoting</au><au>Li, Kang</au><au>Lao, Qicheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models</atitle><date>2023-05-30</date><risdate>2023</risdate><abstract>Large pre-trained vision-language models have shown great prominence in transferring pre-acquired knowledge to various domains and downstream tasks with appropriate prompting or tuning. Existing prevalent tuning methods can be generally categorized into three genres: 1) prompt engineering by creating suitable prompt texts, which is time-consuming and requires domain expertise; 2) or simply fine-tuning the whole model, which is extremely inefficient; 3) prompt tuning through parameterized prompt embeddings with the text encoder. Nevertheless, all methods rely on the text encoder for bridging the modality gap between vision and language. In this work, we question the necessity of the cumbersome text encoder for a more lightweight and efficient tuning paradigm as well as more representative prompt embeddings closer to the image representations. To achieve this, we propose a Concept Embedding Search (ConES) approach by optimizing prompt embeddings -- without the need of the text encoder -- to capture the 'concept' of the image modality through a variety of task objectives. By dropping the text encoder, we are able to significantly speed up the learning process, \eg, from about an hour to just ten minutes in our experiments for personalized text-to-image generation without impairing the generation quality. Moreover, our proposed approach is orthogonal to current existing tuning methods since the searched concept embeddings can be further utilized in the next stage of fine-tuning the pre-trained large models for boosting performance. Extensive experiments show that our approach can beat the prompt tuning and textual inversion methods in a variety of downstream tasks including objection detection, instance segmentation, and image generation. Our approach also shows better generalization capability for unseen concepts in specialized domains, such as the medical domain.</abstract><doi>10.48550/arxiv.2305.18993</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2305.18993
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2305_18993
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	ConES: Concept Embedding Search for Parameter Efficient Tuning Large Vision Language Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T23%3A19%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ConES:%20Concept%20Embedding%20Search%20for%20Parameter%20Efficient%20Tuning%20Large%20Vision%20Language%20Models&rft.au=Yi,%20Huahui&rft.date=2023-05-30&rft_id=info:doi/10.48550/arxiv.2305.18993&rft_dat=%3Carxiv_GOX%3E2305_18993%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true