Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts
A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common...
Gespeichert in:
Veröffentlicht in: | International journal of multimedia information retrieval 2024-03, Vol.13 (1), p.14, Article 14 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | 1 |
container_start_page | 14 |
container_title | International journal of multimedia information retrieval |
container_volume | 13 |
creator | Zhang, Huaying Yanagi, Rintaro Togo, Ren Ogawa, Takahiro Haseyama, Miki |
description | A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency. |
doi_str_mv | 10.1007/s13735-024-00322-y |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2933509406</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2933285772</sourcerecordid><originalsourceid>FETCH-LOGICAL-c298t-179a16181367020d4d7fdc3135e10805d91dce1ad95dfc8a53ff3fbfe78a25a13</originalsourceid><addsrcrecordid>eNqFkU9LAzEQxRdRsNR-AU8Bz9FM0t3sHqX4Dwp6UPAWpptJ2dLdrUm22G9v2oredC7zYH5vBuZl2SWIaxBC3wRQWuVcyCkXQknJdyfZSEIleVHI99MfDXCeTUJYiVSlLEDoURZf0GNLkTwn55q6oS6yOHRNt2S9Y7XvQ-Btb3HNPEXf0DYp13uGLGyobpKFWYy4wEBs2yCLHpsOF2tikT7jkGjsbJqEvdz4vt3EcJGdOVwHmnz3cfZ2f_c6e-Tz54en2e2c17IqIwddIRRQgiq0kMJOrXa2VqByAlGK3FZgawK0VW5dXWKunFNu4UiXKHMENc6ujnvT3Y-BQjSrfvBdOmlkpVQuqqko_qNkmWstEyWP1OElnpzZ-KZFvzMgzD4Gc4zBpBjMIQazSyZ1NIUEd0vyv6v_cH0BcbWMMQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2933285772</pqid></control><display><type>article</type><title>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</title><source>SpringerLink Journals - AutoHoldings</source><creator>Zhang, Huaying ; Yanagi, Rintaro ; Togo, Ren ; Ogawa, Takahiro ; Haseyama, Miki</creator><creatorcontrib>Zhang, Huaying ; Yanagi, Rintaro ; Togo, Ren ; Ogawa, Takahiro ; Haseyama, Miki</creatorcontrib><description>A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.</description><identifier>ISSN: 2192-6611</identifier><identifier>EISSN: 2192-662X</identifier><identifier>DOI: 10.1007/s13735-024-00322-y</identifier><language>eng</language><publisher>London: Springer London</publisher><subject>Accuracy ; Computer Science ; Data Mining and Knowledge Discovery ; Database Management ; Datasets ; Embedding ; Image Processing and Computer Vision ; Image retrieval ; Information Storage and Retrieval ; Information Systems Applications (incl.Internet) ; Mathematical models ; Methods ; Multimedia Information Systems ; Neural networks ; Parameters ; Regular Paper ; Retrieval ; Semantics ; Texts ; Training</subject><ispartof>International journal of multimedia information retrieval, 2024-03, Vol.13 (1), p.14, Article 14</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c298t-179a16181367020d4d7fdc3135e10805d91dce1ad95dfc8a53ff3fbfe78a25a13</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s13735-024-00322-y$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s13735-024-00322-y$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Zhang, Huaying</creatorcontrib><creatorcontrib>Yanagi, Rintaro</creatorcontrib><creatorcontrib>Togo, Ren</creatorcontrib><creatorcontrib>Ogawa, Takahiro</creatorcontrib><creatorcontrib>Haseyama, Miki</creatorcontrib><title>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</title><title>International journal of multimedia information retrieval</title><addtitle>Int J Multimed Info Retr</addtitle><description>A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.</description><subject>Accuracy</subject><subject>Computer Science</subject><subject>Data Mining and Knowledge Discovery</subject><subject>Database Management</subject><subject>Datasets</subject><subject>Embedding</subject><subject>Image Processing and Computer Vision</subject><subject>Image retrieval</subject><subject>Information Storage and Retrieval</subject><subject>Information Systems Applications (incl.Internet)</subject><subject>Mathematical models</subject><subject>Methods</subject><subject>Multimedia Information Systems</subject><subject>Neural networks</subject><subject>Parameters</subject><subject>Regular Paper</subject><subject>Retrieval</subject><subject>Semantics</subject><subject>Texts</subject><subject>Training</subject><issn>2192-6611</issn><issn>2192-662X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNqFkU9LAzEQxRdRsNR-AU8Bz9FM0t3sHqX4Dwp6UPAWpptJ2dLdrUm22G9v2oredC7zYH5vBuZl2SWIaxBC3wRQWuVcyCkXQknJdyfZSEIleVHI99MfDXCeTUJYiVSlLEDoURZf0GNLkTwn55q6oS6yOHRNt2S9Y7XvQ-Btb3HNPEXf0DYp13uGLGyobpKFWYy4wEBs2yCLHpsOF2tikT7jkGjsbJqEvdz4vt3EcJGdOVwHmnz3cfZ2f_c6e-Tz54en2e2c17IqIwddIRRQgiq0kMJOrXa2VqByAlGK3FZgawK0VW5dXWKunFNu4UiXKHMENc6ujnvT3Y-BQjSrfvBdOmlkpVQuqqko_qNkmWstEyWP1OElnpzZ-KZFvzMgzD4Gc4zBpBjMIQazSyZ1NIUEd0vyv6v_cH0BcbWMMQ</recordid><startdate>20240301</startdate><enddate>20240301</enddate><creator>Zhang, Huaying</creator><creator>Yanagi, Rintaro</creator><creator>Togo, Ren</creator><creator>Ogawa, Takahiro</creator><creator>Haseyama, Miki</creator><general>Springer London</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope></search><sort><creationdate>20240301</creationdate><title>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</title><author>Zhang, Huaying ; Yanagi, Rintaro ; Togo, Ren ; Ogawa, Takahiro ; Haseyama, Miki</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c298t-179a16181367020d4d7fdc3135e10805d91dce1ad95dfc8a53ff3fbfe78a25a13</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Computer Science</topic><topic>Data Mining and Knowledge Discovery</topic><topic>Database Management</topic><topic>Datasets</topic><topic>Embedding</topic><topic>Image Processing and Computer Vision</topic><topic>Image retrieval</topic><topic>Information Storage and Retrieval</topic><topic>Information Systems Applications (incl.Internet)</topic><topic>Mathematical models</topic><topic>Methods</topic><topic>Multimedia Information Systems</topic><topic>Neural networks</topic><topic>Parameters</topic><topic>Regular Paper</topic><topic>Retrieval</topic><topic>Semantics</topic><topic>Texts</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Huaying</creatorcontrib><creatorcontrib>Yanagi, Rintaro</creatorcontrib><creatorcontrib>Togo, Ren</creatorcontrib><creatorcontrib>Ogawa, Takahiro</creatorcontrib><creatorcontrib>Haseyama, Miki</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><jtitle>International journal of multimedia information retrieval</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Huaying</au><au>Yanagi, Rintaro</au><au>Togo, Ren</au><au>Ogawa, Takahiro</au><au>Haseyama, Miki</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts</atitle><jtitle>International journal of multimedia information retrieval</jtitle><stitle>Int J Multimed Info Retr</stitle><date>2024-03-01</date><risdate>2024</risdate><volume>13</volume><issue>1</issue><spage>14</spage><pages>14-</pages><artnum>14</artnum><issn>2192-6611</issn><eissn>2192-662X</eissn><abstract>A novel cross-modal image retrieval method realized by parameter efficiently tuning a pre-trained cross-modal model is proposed in this study. Conventional cross-modal retrieval methods realize text-to-image retrieval by training cross-modal models to bring paired texts and images close in a common embedding space. However, these methods are trained on huge amounts of intentionally annotated image-text pairs, which may be unavailable in specific databases. To reduce the dependency on the amount and quality of training data, fine-tuning a pre-trained model is one approach to improve retrieval accuracy on specific personal image databases. However, this approach is parameter inefficient for separately training and retaining models for different databases. Thus, we propose a cross-modal retrieval method that uses prompt learning to solve these problems. The proposed method constructs two types of prompts, a textual prompt and a visual prompt, which are both multi-dimensional vectors. The textual and visual prompts are then concatenated with input texts and images, respectively. By optimizing the prompts to bring paired texts and images close in the common embedding space, the proposed method can improve retrieval accuracy with only a few parameters being updated. The experimental results demonstrate that the proposed method is effective for improving retrieval accuracy and outperforms conventional methods in terms of parameter efficiency.</abstract><cop>London</cop><pub>Springer London</pub><doi>10.1007/s13735-024-00322-y</doi></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2192-6611 |
ispartof | International journal of multimedia information retrieval, 2024-03, Vol.13 (1), p.14, Article 14 |
issn | 2192-6611 2192-662X |
language | eng |
recordid | cdi_proquest_journals_2933509406 |
source | SpringerLink Journals - AutoHoldings |
subjects | Accuracy Computer Science Data Mining and Knowledge Discovery Database Management Datasets Embedding Image Processing and Computer Vision Image retrieval Information Storage and Retrieval Information Systems Applications (incl.Internet) Mathematical models Methods Multimedia Information Systems Neural networks Parameters Regular Paper Retrieval Semantics Texts Training |
title | Parameter-efficient tuning of cross-modal retrieval for a specific database via trainable textual and visual prompts |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T16%3A53%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Parameter-efficient%20tuning%20of%20cross-modal%20retrieval%20for%20a%20specific%20database%20via%20trainable%20textual%20and%20visual%20prompts&rft.jtitle=International%20journal%20of%20multimedia%20information%20retrieval&rft.au=Zhang,%20Huaying&rft.date=2024-03-01&rft.volume=13&rft.issue=1&rft.spage=14&rft.pages=14-&rft.artnum=14&rft.issn=2192-6611&rft.eissn=2192-662X&rft_id=info:doi/10.1007/s13735-024-00322-y&rft_dat=%3Cproquest_cross%3E2933285772%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2933285772&rft_id=info:pmid/&rfr_iscdi=true |