Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	International journal of multimedia information retrieval 2024-03, Vol.13 (1), p.14, Article 14
Hauptverfasser:	Zhang, Huaying, Yanagi, Rintaro, Togo, Ren, Ogawa, Takahiro, Haseyama, Miki
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Computer Science Data Mining and Knowledge Discovery Database Management Datasets Embedding Image Processing and Computer Vision Image retrieval Information Storage and Retrieval Information Systems Applications (incl.Internet) Mathematical models Methods Multimedia Information Systems Neural networks Parameters Regular Paper Retrieval Semantics Texts Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	1
container_start_page	14
container_title	International journal of multimedia information retrieval
container_volume	13
creator	Zhang, Huaying Yanagi, Rintaro Togo, Ren Ogawa, Takahiro Haseyama, Miki
description	A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.
doi_str_mv	10.1007/s13735-024-00322-y
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2933509406</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2933285772</sourcerecordid><originalsourceid>FETCH-LOGICAL-c298t-179a16181367020d4d7fdc3135e10805d91dce1ad95dfc8a53ff3fbfe78a25a13</originalsourceid><addsrcrecordid>eNqFkU9LAzEQxRdRsNR-AU8Bz9FM0t3sHqX4Dwp6UPAWpptJ2dLdrUm22G9v2oredC7zYH5vBuZl2SWIaxBC3wRQWuVcyCkXQknJdyfZSEIleVHI99MfDXCeTUJYiVSlLEDoURZf0GNLkTwn55q6oS6yOHRNt2S9Y7XvQ-Btb3HNPEXf0DYp13uGLGyobpKFWYy4wEBs2yCLHpsOF2tikT7jkGjsbJqEvdz4vt3EcJGdOVwHmnz3cfZ2f_c6e-Tz54en2e2c17IqIwddIRRQgiq0kMJOrXa2VqByAlGK3FZgawK0VW5dXWKunFNu4UiXKHMENc6ujnvT3Y-BQjSrfvBdOmlkpVQuqqko_qNkmWstEyWP1OElnpzZ-KZFvzMgzD4Gc4zBpBjMIQazSyZ1NIUEd0vyv6v_cH0BcbWMMQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2933285772</pqid></control><display><type>article</type><title>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</title><source>SpringerLink Journals - AutoHoldings</source><creator>Zhang, Huaying ; Yanagi, Rintaro ; Togo, Ren ; Ogawa, Takahiro ; Haseyama, Miki</creator><creatorcontrib>Zhang, Huaying ; Yanagi, Rintaro ; Togo, Ren ; Ogawa, Takahiro ; Haseyama, Miki</creatorcontrib><description>A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.</description><identifier>ISSN: 2192-6611</identifier><identifier>EISSN: 2192-662X</identifier><identifier>DOI: 10.1007/s13735-024-00322-y</identifier><language>eng</language><publisher>London: Springer London</publisher><subject>Accuracy ; Computer Science ; Data Mining and Knowledge Discovery ; Database Management ; Datasets ; Embedding ; Image Processing and Computer Vision ; Image retrieval ; Information Storage and Retrieval ; Information Systems Applications (incl.Internet) ; Mathematical models ; Methods ; Multimedia Information Systems ; Neural networks ; Parameters ; Regular Paper ; Retrieval ; Semantics ; Texts ; Training</subject><ispartof>International journal of multimedia information retrieval, 2024-03, Vol.13 (1), p.14, Article 14</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c298t-179a16181367020d4d7fdc3135e10805d91dce1ad95dfc8a53ff3fbfe78a25a13</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s13735-024-00322-y$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s13735-024-00322-y$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Zhang, Huaying</creatorcontrib><creatorcontrib>Yanagi, Rintaro</creatorcontrib><creatorcontrib>Togo, Ren</creatorcontrib><creatorcontrib>Ogawa, Takahiro</creatorcontrib><creatorcontrib>Haseyama, Miki</creatorcontrib><title>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</title><title>International journal of multimedia information retrieval</title><addtitle>Int J Multimed Info Retr</addtitle><description>A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.</description><subject>Accuracy</subject><subject>Computer Science</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Database Management</subject><subject>Datasets</subject><subject>Embedding</subject><subject>Image Processing and Computer Vision</subject><subject>Image retrieval</subject><subject>Information Storage and Retrieval</subject><subject>Information Systems Applications (incl.Internet)</subject><subject>Mathematical models</subject><subject>Methods</subject><subject>Multimedia Information Systems</subject><subject>Neural networks</subject><subject>Parameters</subject><subject>Regular Paper</subject><subject>Retrieval</subject><subject>Semantics</subject><subject>Texts</subject><subject>Training</subject><issn>2192-6611</issn><issn>2192-662X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNqFkU9LAzEQxRdRsNR-AU8Bz9FM0t3sHqX4Dwp6UPAWpptJ2dLdrUm22G9v2oredC7zYH5vBuZl2SWIaxBC3wRQWuVcyCkXQknJdyfZSEIleVHI99MfDXCeTUJYiVSlLEDoURZf0GNLkTwn55q6oS6yOHRNt2S9Y7XvQ-Btb3HNPEXf0DYp13uGLGyobpKFWYy4wEBs2yCLHpsOF2tikT7jkGjsbJqEvdz4vt3EcJGdOVwHmnz3cfZ2f_c6e-Tz54en2e2c17IqIwddIRRQgiq0kMJOrXa2VqByAlGK3FZgawK0VW5dXWKunFNu4UiXKHMENc6ujnvT3Y-BQjSrfvBdOmlkpVQuqqko_qNkmWstEyWP1OElnpzZ-KZFvzMgzD4Gc4zBpBjMIQazSyZ1NIUEd0vyv6v_cH0BcbWMMQ</recordid><startdate>20240301</startdate><enddate>20240301</enddate><creator>Zhang, Huaying</creator><creator>Yanagi, Rintaro</creator><creator>Togo, Ren</creator><creator>Ogawa, Takahiro</creator><creator>Haseyama, Miki</creator><general>Springer London</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope></search><sort><creationdate>20240301</creationdate><title>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</title><author>Zhang, Huaying ; Yanagi, Rintaro ; Togo, Ren ; Ogawa, Takahiro ; Haseyama, Miki</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c298t-179a16181367020d4d7fdc3135e10805d91dce1ad95dfc8a53ff3fbfe78a25a13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Computer Science</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Database Management</topic><topic>Datasets</topic><topic>Embedding</topic><topic>Image Processing and Computer Vision</topic><topic>Image retrieval</topic><topic>Information Storage and Retrieval</topic><topic>Information Systems Applications (incl.Internet)</topic><topic>Mathematical models</topic><topic>Methods</topic><topic>Multimedia Information Systems</topic><topic>Neural networks</topic><topic>Parameters</topic><topic>Regular Paper</topic><topic>Retrieval</topic><topic>Semantics</topic><topic>Texts</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Huaying</creatorcontrib><creatorcontrib>Yanagi, Rintaro</creatorcontrib><creatorcontrib>Togo, Ren</creatorcontrib><creatorcontrib>Ogawa, Takahiro</creatorcontrib><creatorcontrib>Haseyama, Miki</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><jtitle>International journal of multimedia information retrieval</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Huaying</au><au>Yanagi, Rintaro</au><au>Togo, Ren</au><au>Ogawa, Takahiro</au><au>Haseyama, Miki</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</atitle><jtitle>International journal of multimedia information retrieval</jtitle><stitle>Int J Multimed Info Retr</stitle><date>2024-03-01</date><risdate>2024</risdate><volume>13</volume><issue>1</issue><spage>14</spage><pages>14-</pages><artnum>14</artnum><issn>2192-6611</issn><eissn>2192-662X</eissn><abstract>A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.</abstract><cop>London</cop><pub>Springer London</pub><doi>10.1007/s13735-024-00322-y</doi></addata></record>
fulltext	fulltext
identifier	ISSN: 2192-6611
ispartof	International journal of multimedia information retrieval, 2024-03, Vol.13 (1), p.14, Article 14
issn	2192-6611 2192-662X
language	eng
recordid	cdi_proquest_journals_2933509406
source	SpringerLink Journals - AutoHoldings
subjects	Accuracy Computer Science Data Mining and Knowledge Discovery Database Management Datasets Embedding Image Processing and Computer Vision Image retrieval Information Storage and Retrieval Information Systems Applications (incl.Internet) Mathematical models Methods Multimedia Information Systems Neural networks Parameters Regular Paper Retrieval Semantics Texts Training
title	Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T16%3A53%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Parameter-efficient%20tuning%20of%20cross-modal%20retrieval%20for%20a%20specific%20database%20via%20trainable%20textual%20and%20visual%20prompts&rft.jtitle=International%20journal%20of%20multimedia%20information%20retrieval&rft.au=Zhang,%20Huaying&rft.date=2024-03-01&rft.volume=13&rft.issue=1&rft.spage=14&rft.pages=14-&rft.artnum=14&rft.issn=2192-6611&rft.eissn=2192-662X&rft_id=info:doi/10.1007/s13735-024-00322-y&rft_dat=%3Cproquest_cross%3E2933285772%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2933285772&rft_id=info:pmid/&rfr_iscdi=true