Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts

A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of multimedia information retrieval 2024-03, Vol.13 (1), p.14, Article 14
Hauptverfasser: Zhang, Huaying, Yanagi, Rintaro, Togo, Ren, Ogawa, Takahiro, Haseyama, Miki
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 1
container_start_page 14
container_title International journal of multimedia information retrieval
container_volume 13
creator Zhang, Huaying
Yanagi, Rintaro
Togo, Ren
Ogawa, Takahiro
Haseyama, Miki
description A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.
doi_str_mv 10.1007/s13735-024-00322-y
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2933509406</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2933285772</sourcerecordid><originalsourceid>FETCH-LOGICAL-c298t-179a16181367020d4d7fdc3135e10805d91dce1ad95dfc8a53ff3fbfe78a25a13</originalsourceid><addsrcrecordid>eNqFkU9LAzEQxRdRsNR-AU8Bz9FM0t3sHqX4Dwp6UPAWpptJ2dLdrUm22G9v2oredC7zYH5vBuZl2SWIaxBC3wRQWuVcyCkXQknJdyfZSEIleVHI99MfDXCeTUJYiVSlLEDoURZf0GNLkTwn55q6oS6yOHRNt2S9Y7XvQ-Btb3HNPEXf0DYp13uGLGyobpKFWYy4wEBs2yCLHpsOF2tikT7jkGjsbJqEvdz4vt3EcJGdOVwHmnz3cfZ2f_c6e-Tz54en2e2c17IqIwddIRRQgiq0kMJOrXa2VqByAlGK3FZgawK0VW5dXWKunFNu4UiXKHMENc6ujnvT3Y-BQjSrfvBdOmlkpVQuqqko_qNkmWstEyWP1OElnpzZ-KZFvzMgzD4Gc4zBpBjMIQazSyZ1NIUEd0vyv6v_cH0BcbWMMQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2933285772</pqid></control><display><type>article</type><title>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</title><source>SpringerLink Journals - AutoHoldings</source><creator>Zhang, Huaying ; Yanagi, Rintaro ; Togo, Ren ; Ogawa, Takahiro ; Haseyama, Miki</creator><creatorcontrib>Zhang, Huaying ; Yanagi, Rintaro ; Togo, Ren ; Ogawa, Takahiro ; Haseyama, Miki</creatorcontrib><description>A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.</description><identifier>ISSN: 2192-6611</identifier><identifier>EISSN: 2192-662X</identifier><identifier>DOI: 10.1007/s13735-024-00322-y</identifier><language>eng</language><publisher>London: Springer London</publisher><subject>Accuracy ; Computer Science ; Data Mining and Knowledge Discovery ; Database Management ; Datasets ; Embedding ; Image Processing and Computer Vision ; Image retrieval ; Information Storage and Retrieval ; Information Systems Applications (incl.Internet) ; Mathematical models ; Methods ; Multimedia Information Systems ; Neural networks ; Parameters ; Regular Paper ; Retrieval ; Semantics ; Texts ; Training</subject><ispartof>International journal of multimedia information retrieval, 2024-03, Vol.13 (1), p.14, Article 14</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c298t-179a16181367020d4d7fdc3135e10805d91dce1ad95dfc8a53ff3fbfe78a25a13</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s13735-024-00322-y$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s13735-024-00322-y$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Zhang, Huaying</creatorcontrib><creatorcontrib>Yanagi, Rintaro</creatorcontrib><creatorcontrib>Togo, Ren</creatorcontrib><creatorcontrib>Ogawa, Takahiro</creatorcontrib><creatorcontrib>Haseyama, Miki</creatorcontrib><title>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</title><title>International journal of multimedia information retrieval</title><addtitle>Int J Multimed Info Retr</addtitle><description>A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.</description><subject>Accuracy</subject><subject>Computer Science</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Database Management</subject><subject>Datasets</subject><subject>Embedding</subject><subject>Image Processing and Computer Vision</subject><subject>Image retrieval</subject><subject>Information Storage and Retrieval</subject><subject>Information Systems Applications (incl.Internet)</subject><subject>Mathematical models</subject><subject>Methods</subject><subject>Multimedia Information Systems</subject><subject>Neural networks</subject><subject>Parameters</subject><subject>Regular Paper</subject><subject>Retrieval</subject><subject>Semantics</subject><subject>Texts</subject><subject>Training</subject><issn>2192-6611</issn><issn>2192-662X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNqFkU9LAzEQxRdRsNR-AU8Bz9FM0t3sHqX4Dwp6UPAWpptJ2dLdrUm22G9v2oredC7zYH5vBuZl2SWIaxBC3wRQWuVcyCkXQknJdyfZSEIleVHI99MfDXCeTUJYiVSlLEDoURZf0GNLkTwn55q6oS6yOHRNt2S9Y7XvQ-Btb3HNPEXf0DYp13uGLGyobpKFWYy4wEBs2yCLHpsOF2tikT7jkGjsbJqEvdz4vt3EcJGdOVwHmnz3cfZ2f_c6e-Tz54en2e2c17IqIwddIRRQgiq0kMJOrXa2VqByAlGK3FZgawK0VW5dXWKunFNu4UiXKHMENc6ujnvT3Y-BQjSrfvBdOmlkpVQuqqko_qNkmWstEyWP1OElnpzZ-KZFvzMgzD4Gc4zBpBjMIQazSyZ1NIUEd0vyv6v_cH0BcbWMMQ</recordid><startdate>20240301</startdate><enddate>20240301</enddate><creator>Zhang, Huaying</creator><creator>Yanagi, Rintaro</creator><creator>Togo, Ren</creator><creator>Ogawa, Takahiro</creator><creator>Haseyama, Miki</creator><general>Springer London</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope></search><sort><creationdate>20240301</creationdate><title>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</title><author>Zhang, Huaying ; Yanagi, Rintaro ; Togo, Ren ; Ogawa, Takahiro ; Haseyama, Miki</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c298t-179a16181367020d4d7fdc3135e10805d91dce1ad95dfc8a53ff3fbfe78a25a13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Computer Science</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Database Management</topic><topic>Datasets</topic><topic>Embedding</topic><topic>Image Processing and Computer Vision</topic><topic>Image retrieval</topic><topic>Information Storage and Retrieval</topic><topic>Information Systems Applications (incl.Internet)</topic><topic>Mathematical models</topic><topic>Methods</topic><topic>Multimedia Information Systems</topic><topic>Neural networks</topic><topic>Parameters</topic><topic>Regular Paper</topic><topic>Retrieval</topic><topic>Semantics</topic><topic>Texts</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Huaying</creatorcontrib><creatorcontrib>Yanagi, Rintaro</creatorcontrib><creatorcontrib>Togo, Ren</creatorcontrib><creatorcontrib>Ogawa, Takahiro</creatorcontrib><creatorcontrib>Haseyama, Miki</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><jtitle>International journal of multimedia information retrieval</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Huaying</au><au>Yanagi, Rintaro</au><au>Togo, Ren</au><au>Ogawa, Takahiro</au><au>Haseyama, Miki</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</atitle><jtitle>International journal of multimedia information retrieval</jtitle><stitle>Int J Multimed Info Retr</stitle><date>2024-03-01</date><risdate>2024</risdate><volume>13</volume><issue>1</issue><spage>14</spage><pages>14-</pages><artnum>14</artnum><issn>2192-6611</issn><eissn>2192-662X</eissn><abstract>A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.</abstract><cop>London</cop><pub>Springer London</pub><doi>10.1007/s13735-024-00322-y</doi></addata></record>
fulltext fulltext
identifier ISSN: 2192-6611
ispartof International journal of multimedia information retrieval, 2024-03, Vol.13 (1), p.14, Article 14
issn 2192-6611
2192-662X
language eng
recordid cdi_proquest_journals_2933509406
source SpringerLink Journals - AutoHoldings
subjects Accuracy
Computer Science
Data Mining and Knowledge Discovery
Database Management
Datasets
Embedding
Image Processing and Computer Vision
Image retrieval
Information Storage and Retrieval
Information Systems Applications (incl.Internet)
Mathematical models
Methods
Multimedia Information Systems
Neural networks
Parameters
Regular Paper
Retrieval
Semantics
Texts
Training
title Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T16%3A53%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Parameter-efficient%20tuning%20of%20cross-modal%20retrieval%20for%20a%20specific%20database%20via%20trainable%20textual%20and%20visual%20prompts&rft.jtitle=International%20journal%20of%20multimedia%20information%20retrieval&rft.au=Zhang,%20Huaying&rft.date=2024-03-01&rft.volume=13&rft.issue=1&rft.spage=14&rft.pages=14-&rft.artnum=14&rft.issn=2192-6611&rft.eissn=2192-662X&rft_id=info:doi/10.1007/s13735-024-00322-y&rft_dat=%3Cproquest_cross%3E2933285772%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2933285772&rft_id=info:pmid/&rfr_iscdi=true