Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval

Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is chal...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-11, Vol.34 (11), p.12019-12031
Hauptverfasser:	Lai, Huakai, Yang, Wenfei, Zhang, Tianzhu, Zhang, Yongdong
Format:	Artikel
Sprache:	eng
Schlagworte:	attention work Correlation denoised decoder Feature extraction Noise generation Prototypes Reliability reliable phrase mining Retrieval Semantics Similarity Task analysis Training Video-text retrieval Words (language)
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	12031
container_issue	11
container_start_page	12019
container_title	IEEE transactions on circuits and systems for video technology
container_volume	34
creator	Lai, Huakai Yang, Wenfei Zhang, Tianzhu Zhang, Yongdong
description	Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is challenging. Previous best-performing methods adopt video-sentence, phrase-phrase, and frame-word interactions simultaneously. Different from word/frame features that can be obtained directly, phrase features need to be adaptively aggregated from correlative word/frame features, which makes it very demanding. However, existing method utilizes simple intra-modal self-attention to generate phrase features without considering the following three aspects: cross-modality semantic correlation, phrase generation noise and diversity. In this paper, we propose a novel Reliable Phrase Mining model (RPM) to construct reliable phrase features and conduct hierarchical cross-modal interactions for video-text retrieval. The proposed RPM model enjoys several merits. Firstly, to guarantee the semantic consistency between video phrases and text phrases, we propose a set of modality-shared prototypes as the joint query to aggregate the semantically related frame/word features into adaptive-grained phrase features. Secondly, to deal with the phrase generation noise, the proposed denoised decoder module is responsible for obtaining more reliable similarity between prototypes and frame/word features. Specifically, not only the correlation between frame/word features and prototypes, but also the correlation among prototypes, should be taken into account when calculating the similarity. Furthermore, to encourage different prototypes to focus on different semantic information, we design a prototype contrastive loss whose core idea is enabling phrases produced by the same prototype to be more similar than those produced by different prototypes. Extensive experiment results demonstrate that the proposed method performs favorably on three benchmark datasets, including MSR-VTT, MSVD, and LSMDC.
doi_str_mv	10.1109/TCSVT.2024.3422869
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TCSVT_2024_3422869</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10583931</ieee_id><sourcerecordid>3133497743</sourcerecordid><originalsourceid>FETCH-LOGICAL-c923-f6d304ca1fca5fc59f4e828591dcda130bd1c6336ab0e1868efd6adea4ef30463</originalsourceid><addsrcrecordid>eNpNkE1PAjEURRujiYj-AeOiievBftNZGhQxwWhwwrYp7auUjAy2A9F_7yAsXL27uOe-5CB0TcmAUlLeVaP3eTVghIkBF4xpVZ6gHpVSF4wRedplImmhGZXn6CLnFSFUaDHsoYcZ1NEuasBvy2Qz4DHYdpsAv8R1XH_g0CQ8iZBscsvobI3n0UNTVPDd4hm0KcLO1pfoLNg6w9Xx9lE1fqxGk2L6-vQ8up8WrmS8CMpzIpylwVkZnCyDAM20LKl33lJOFp46xbmyCwJUKw3BK-vBCggdqHgf3R5mN6n52kJuzarZpnX30XDKuSiHQ8G7Fju0XGpyThDMJsVPm34MJWYvy_zJMntZ5iirg24OUASAf4DUvOy2fwFp-WYr</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3133497743</pqid></control><display><type>article</type><title>Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval</title><source>IEEE Electronic Library (IEL)</source><creator>Lai, Huakai ; Yang, Wenfei ; Zhang, Tianzhu ; Zhang, Yongdong</creator><creatorcontrib>Lai, Huakai ; Yang, Wenfei ; Zhang, Tianzhu ; Zhang, Yongdong</creatorcontrib><description>Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is challenging. Previous best-performing methods adopt video-sentence, phrase-phrase, and frame-word interactions simultaneously. Different from word/frame features that can be obtained directly, phrase features need to be adaptively aggregated from correlative word/frame features, which makes it very demanding. However, existing method utilizes simple intra-modal self-attention to generate phrase features without considering the following three aspects: cross-modality semantic correlation, phrase generation noise and diversity. In this paper, we propose a novel Reliable Phrase Mining model (RPM) to construct reliable phrase features and conduct hierarchical cross-modal interactions for video-text retrieval. The proposed RPM model enjoys several merits. Firstly, to guarantee the semantic consistency between video phrases and text phrases, we propose a set of modality-shared prototypes as the joint query to aggregate the semantically related frame/word features into adaptive-grained phrase features. Secondly, to deal with the phrase generation noise, the proposed denoised decoder module is responsible for obtaining more reliable similarity between prototypes and frame/word features. Specifically, not only the correlation between frame/word features and prototypes, but also the correlation among prototypes, should be taken into account when calculating the similarity. Furthermore, to encourage different prototypes to focus on different semantic information, we design a prototype contrastive loss whose core idea is enabling phrases produced by the same prototype to be more similar than those produced by different prototypes. Extensive experiment results demonstrate that the proposed method performs favorably on three benchmark datasets, including MSR-VTT, MSVD, and LSMDC.</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2024.3422869</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>attention work ; Correlation ; denoised decoder ; Feature extraction ; Noise generation ; Prototypes ; Reliability ; reliable phrase mining ; Retrieval ; Semantics ; Similarity ; Task analysis ; Training ; Video-text retrieval ; Words (language)</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-11, Vol.34 (11), p.12019-12031</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c923-f6d304ca1fca5fc59f4e828591dcda130bd1c6336ab0e1868efd6adea4ef30463</cites><orcidid>0000-0003-3599-7659 ; 0000-0002-1151-1792 ; 0000-0003-1856-9564 ; 0009-0004-9768-7861</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10583931$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10583931$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Lai, Huakai</creatorcontrib><creatorcontrib>Yang, Wenfei</creatorcontrib><creatorcontrib>Zhang, Tianzhu</creatorcontrib><creatorcontrib>Zhang, Yongdong</creatorcontrib><title>Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is challenging. Previous best-performing methods adopt video-sentence, phrase-phrase, and frame-word interactions simultaneously. Different from word/frame features that can be obtained directly, phrase features need to be adaptively aggregated from correlative word/frame features, which makes it very demanding. However, existing method utilizes simple intra-modal self-attention to generate phrase features without considering the following three aspects: cross-modality semantic correlation, phrase generation noise and diversity. In this paper, we propose a novel Reliable Phrase Mining model (RPM) to construct reliable phrase features and conduct hierarchical cross-modal interactions for video-text retrieval. The proposed RPM model enjoys several merits. Firstly, to guarantee the semantic consistency between video phrases and text phrases, we propose a set of modality-shared prototypes as the joint query to aggregate the semantically related frame/word features into adaptive-grained phrase features. Secondly, to deal with the phrase generation noise, the proposed denoised decoder module is responsible for obtaining more reliable similarity between prototypes and frame/word features. Specifically, not only the correlation between frame/word features and prototypes, but also the correlation among prototypes, should be taken into account when calculating the similarity. Furthermore, to encourage different prototypes to focus on different semantic information, we design a prototype contrastive loss whose core idea is enabling phrases produced by the same prototype to be more similar than those produced by different prototypes. Extensive experiment results demonstrate that the proposed method performs favorably on three benchmark datasets, including MSR-VTT, MSVD, and LSMDC.</description><subject>attention work</subject><subject>Correlation</subject><subject>denoised decoder</subject><subject>Feature extraction</subject><subject>Noise generation</subject><subject>Prototypes</subject><subject>Reliability</subject><subject>reliable phrase mining</subject><subject>Retrieval</subject><subject>Semantics</subject><subject>Similarity</subject><subject>Task analysis</subject><subject>Training</subject><subject>Video-text retrieval</subject><subject>Words (language)</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1PAjEURRujiYj-AeOiievBftNZGhQxwWhwwrYp7auUjAy2A9F_7yAsXL27uOe-5CB0TcmAUlLeVaP3eTVghIkBF4xpVZ6gHpVSF4wRedplImmhGZXn6CLnFSFUaDHsoYcZ1NEuasBvy2Qz4DHYdpsAv8R1XH_g0CQ8iZBscsvobI3n0UNTVPDd4hm0KcLO1pfoLNg6w9Xx9lE1fqxGk2L6-vQ8up8WrmS8CMpzIpylwVkZnCyDAM20LKl33lJOFp46xbmyCwJUKw3BK-vBCggdqHgf3R5mN6n52kJuzarZpnX30XDKuSiHQ8G7Fju0XGpyThDMJsVPm34MJWYvy_zJMntZ5iirg24OUASAf4DUvOy2fwFp-WYr</recordid><startdate>202411</startdate><enddate>202411</enddate><creator>Lai, Huakai</creator><creator>Yang, Wenfei</creator><creator>Zhang, Tianzhu</creator><creator>Zhang, Yongdong</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-3599-7659</orcidid><orcidid>https://orcid.org/0000-0002-1151-1792</orcidid><orcidid>https://orcid.org/0000-0003-1856-9564</orcidid><orcidid>https://orcid.org/0009-0004-9768-7861</orcidid></search><sort><creationdate>202411</creationdate><title>Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval</title><author>Lai, Huakai ; Yang, Wenfei ; Zhang, Tianzhu ; Zhang, Yongdong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c923-f6d304ca1fca5fc59f4e828591dcda130bd1c6336ab0e1868efd6adea4ef30463</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>attention work</topic><topic>Correlation</topic><topic>denoised decoder</topic><topic>Feature extraction</topic><topic>Noise generation</topic><topic>Prototypes</topic><topic>Reliability</topic><topic>reliable phrase mining</topic><topic>Retrieval</topic><topic>Semantics</topic><topic>Similarity</topic><topic>Task analysis</topic><topic>Training</topic><topic>Video-text retrieval</topic><topic>Words (language)</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lai, Huakai</creatorcontrib><creatorcontrib>Yang, Wenfei</creatorcontrib><creatorcontrib>Zhang, Tianzhu</creatorcontrib><creatorcontrib>Zhang, Yongdong</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lai, Huakai</au><au>Yang, Wenfei</au><au>Zhang, Tianzhu</au><au>Zhang, Yongdong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-11</date><risdate>2024</risdate><volume>34</volume><issue>11</issue><spage>12019</spage><epage>12031</epage><pages>12019-12031</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Video-Text Retrieval is a fundamental task in multi-modal understanding and has attracted increasing attention from both academia and industry communities in recent years. Generally, video inherently contains multi-grained semantic and each video corresponds to several different texts, which is challenging. Previous best-performing methods adopt video-sentence, phrase-phrase, and frame-word interactions simultaneously. Different from word/frame features that can be obtained directly, phrase features need to be adaptively aggregated from correlative word/frame features, which makes it very demanding. However, existing method utilizes simple intra-modal self-attention to generate phrase features without considering the following three aspects: cross-modality semantic correlation, phrase generation noise and diversity. In this paper, we propose a novel Reliable Phrase Mining model (RPM) to construct reliable phrase features and conduct hierarchical cross-modal interactions for video-text retrieval. The proposed RPM model enjoys several merits. Firstly, to guarantee the semantic consistency between video phrases and text phrases, we propose a set of modality-shared prototypes as the joint query to aggregate the semantically related frame/word features into adaptive-grained phrase features. Secondly, to deal with the phrase generation noise, the proposed denoised decoder module is responsible for obtaining more reliable similarity between prototypes and frame/word features. Specifically, not only the correlation between frame/word features and prototypes, but also the correlation among prototypes, should be taken into account when calculating the similarity. Furthermore, to encourage different prototypes to focus on different semantic information, we design a prototype contrastive loss whose core idea is enabling phrases produced by the same prototype to be more similar than those produced by different prototypes. Extensive experiment results demonstrate that the proposed method performs favorably on three benchmark datasets, including MSR-VTT, MSVD, and LSMDC.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2024.3422869</doi><tpages>13</tpages><orcidid>https://orcid.org/0000-0003-3599-7659</orcidid><orcidid>https://orcid.org/0000-0002-1151-1792</orcidid><orcidid>https://orcid.org/0000-0003-1856-9564</orcidid><orcidid>https://orcid.org/0009-0004-9768-7861</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2024-11, Vol.34 (11), p.12019-12031
issn	1051-8215 1558-2205
language	eng
recordid	cdi_crossref_primary_10_1109_TCSVT_2024_3422869
source	IEEE Electronic Library (IEL)
subjects	attention work Correlation denoised decoder Feature extraction Noise generation Prototypes Reliability reliable phrase mining Retrieval Semantics Similarity Task analysis Training Video-text retrieval Words (language)
title	Reliable Phrase Feature Mining for Hierarchical Video-Text Retrieval
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T11%3A30%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Reliable%20Phrase%20Feature%20Mining%20for%20Hierarchical%20Video-Text%20Retrieval&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Lai,%20Huakai&rft.date=2024-11&rft.volume=34&rft.issue=11&rft.spage=12019&rft.epage=12031&rft.pages=12019-12031&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2024.3422869&rft_dat=%3Cproquest_RIE%3E3133497743%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3133497743&rft_id=info:pmid/&rft_ieee_id=10583931&rfr_iscdi=true