Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval

Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query. For narrative videos, e.g., dramas or movies, the holistic understanding of temporal dynamics and multimodal reasoning is crucial. Previous works have...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jung, Minjoon, Choi, Seongho, Kim, Joochan, Kim, Jin-Hwa, Zhang, Byoung-Tak
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Jung, Minjoon Choi, Seongho Kim, Joochan Kim, Jin-Hwa Zhang, Byoung-Tak
description	Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query. For narrative videos, e.g., dramas or movies, the holistic understanding of temporal dynamics and multimodal reasoning is crucial. Previous works have shown promising results; however, they relied on the expensive query annotations for VCMR, i.e., the corresponding moment intervals. To overcome this problem, we propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal moments via subtitle-based moment sampling. Then, it generates pseudo queries exploiting both visual and textual information from the selected temporal moments. Through the multimodal information in the pseudo queries, we show that MPGN successfully learns to localize the video corpus moment without any explicit annotation. We validate the effectiveness of MPGN on the TVR dataset, showing competitive results compared with both supervised models and unsupervised setting models.
doi_str_mv	10.48550/arxiv.2210.12617
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2210_12617</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2210_12617</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-d9f9b80520e7bdf0384871cb4868e610cd8221ad6695f73e3790f344bbea56dd3</originalsourceid><addsrcrecordid>eNotz01OwzAUBGBvWKDCAVjhC6TY8W-WVQQtUqsCqrqNnuNnyVIaR05S0dtTCquRZjGaj5AnzpbSKsVeIH_H87IsrwUvNTf3ZLNLHrpiHLCNIbb0Y8TZJ_o5Y77QNfaYYYqppyFleoweE61THuaR7tIJ-4l-4ZQjnqF7IHcBuhEf_3NBDm-vh3pTbPfr93q1LUAbU_gqVM4yVTI0zgcmrLSGt05abVFz1np7fQde60oFI1CYigUhpXMISnsvFuT5b_ZGaYYcT5AvzS-puZHED331Rm4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval</title><source>arXiv.org</source><creator>Jung, Minjoon ; Choi, Seongho ; Kim, Joochan ; Kim, Jin-Hwa ; Zhang, Byoung-Tak</creator><creatorcontrib>Jung, Minjoon ; Choi, Seongho ; Kim, Joochan ; Kim, Jin-Hwa ; Zhang, Byoung-Tak</creatorcontrib><description>Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query. For narrative videos, e.g., dramas or movies, the holistic understanding of temporal dynamics and multimodal reasoning is crucial. Previous works have shown promising results; however, they relied on the expensive query annotations for VCMR, i.e., the corresponding moment intervals. To overcome this problem, we propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal moments via subtitle-based moment sampling. Then, it generates pseudo queries exploiting both visual and textual information from the selected temporal moments. Through the multimodal information in the pseudo queries, we show that MPGN successfully learns to localize the video corpus moment without any explicit annotation. We validate the effectiveness of MPGN on the TVR dataset, showing competitive results compared with both supervised models and unsupervised setting models.</description><identifier>DOI: 10.48550/arxiv.2210.12617</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2022-10</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,883</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2210.12617$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2210.12617$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jung, Minjoon</creatorcontrib><creatorcontrib>Choi, Seongho</creatorcontrib><creatorcontrib>Kim, Joochan</creatorcontrib><creatorcontrib>Kim, Jin-Hwa</creatorcontrib><creatorcontrib>Zhang, Byoung-Tak</creatorcontrib><title>Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval</title><description>Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query. For narrative videos, e.g., dramas or movies, the holistic understanding of temporal dynamics and multimodal reasoning is crucial. Previous works have shown promising results; however, they relied on the expensive query annotations for VCMR, i.e., the corresponding moment intervals. To overcome this problem, we propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal moments via subtitle-based moment sampling. Then, it generates pseudo queries exploiting both visual and textual information from the selected temporal moments. Through the multimodal information in the pseudo queries, we show that MPGN successfully learns to localize the video corpus moment without any explicit annotation. We validate the effectiveness of MPGN on the TVR dataset, showing competitive results compared with both supervised models and unsupervised setting models.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz01OwzAUBGBvWKDCAVjhC6TY8W-WVQQtUqsCqrqNnuNnyVIaR05S0dtTCquRZjGaj5AnzpbSKsVeIH_H87IsrwUvNTf3ZLNLHrpiHLCNIbb0Y8TZJ_o5Y77QNfaYYYqppyFleoweE61THuaR7tIJ-4l-4ZQjnqF7IHcBuhEf_3NBDm-vh3pTbPfr93q1LUAbU_gqVM4yVTI0zgcmrLSGt05abVFz1np7fQde60oFI1CYigUhpXMISnsvFuT5b_ZGaYYcT5AvzS-puZHED331Rm4</recordid><startdate>20221023</startdate><enddate>20221023</enddate><creator>Jung, Minjoon</creator><creator>Choi, Seongho</creator><creator>Kim, Joochan</creator><creator>Kim, Jin-Hwa</creator><creator>Zhang, Byoung-Tak</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20221023</creationdate><title>Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval</title><author>Jung, Minjoon ; Choi, Seongho ; Kim, Joochan ; Kim, Jin-Hwa ; Zhang, Byoung-Tak</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-d9f9b80520e7bdf0384871cb4868e610cd8221ad6695f73e3790f344bbea56dd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Jung, Minjoon</creatorcontrib><creatorcontrib>Choi, Seongho</creatorcontrib><creatorcontrib>Kim, Joochan</creatorcontrib><creatorcontrib>Kim, Jin-Hwa</creatorcontrib><creatorcontrib>Zhang, Byoung-Tak</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jung, Minjoon</au><au>Choi, Seongho</au><au>Kim, Joochan</au><au>Kim, Jin-Hwa</au><au>Zhang, Byoung-Tak</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval</atitle><date>2022-10-23</date><risdate>2022</risdate><abstract>Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query. For narrative videos, e.g., dramas or movies, the holistic understanding of temporal dynamics and multimodal reasoning is crucial. Previous works have shown promising results; however, they relied on the expensive query annotations for VCMR, i.e., the corresponding moment intervals. To overcome this problem, we propose a self-supervised learning framework: Modal-specific Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal moments via subtitle-based moment sampling. Then, it generates pseudo queries exploiting both visual and textual information from the selected temporal moments. Through the multimodal information in the pseudo queries, we show that MPGN successfully learns to localize the video corpus moment without any explicit annotation. We validate the effectiveness of MPGN on the TVR dataset, showing competitive results compared with both supervised models and unsupervised setting models.</abstract><doi>10.48550/arxiv.2210.12617</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2210.12617
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2210_12617
source	arXiv.org
subjects	Computer Science - Computation and Language
title	Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T01%3A47%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Modal-specific%20Pseudo%20Query%20Generation%20for%20Video%20Corpus%20Moment%20Retrieval&rft.au=Jung,%20Minjoon&rft.date=2022-10-23&rft_id=info:doi/10.48550/arxiv.2210.12617&rft_dat=%3Carxiv_GOX%3E2210_12617%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true