Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval
Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query. For narrative videos, e.g., dramas or movies, the holistic understanding of temporal dynamics and multimodal reasoning is crucial. Previous works have...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Jung, Minjoon Choi, Seongho Kim, Joochan Kim, Jin-Hwa Zhang, Byoung-Tak |
description | Video corpus moment retrieval (VCMR) is the task to retrieve the most
relevant video moment from a large video corpus using a natural language query.
For narrative videos, e.g., dramas or movies, the holistic understanding of
temporal dynamics and multimodal reasoning is crucial. Previous works have
shown promising results; however, they relied on the expensive query
annotations for VCMR, i.e., the corresponding moment intervals. To overcome
this problem, we propose a self-supervised learning framework: Modal-specific
Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal
moments via subtitle-based moment sampling. Then, it generates pseudo queries
exploiting both visual and textual information from the selected temporal
moments. Through the multimodal information in the pseudo queries, we show that
MPGN successfully learns to localize the video corpus moment without any
explicit annotation. We validate the effectiveness of MPGN on the TVR dataset,
showing competitive results compared with both supervised models and
unsupervised setting models. |
doi_str_mv | 10.48550/arxiv.2210.12617 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2210_12617</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2210_12617</sourcerecordid><originalsourceid>FETCH-LOGICAL-a677-d9f9b80520e7bdf0384871cb4868e610cd8221ad6695f73e3790f344bbea56dd3</originalsourceid><addsrcrecordid>eNotz01OwzAUBGBvWKDCAVjhC6TY8W-WVQQtUqsCqrqNnuNnyVIaR05S0dtTCquRZjGaj5AnzpbSKsVeIH_H87IsrwUvNTf3ZLNLHrpiHLCNIbb0Y8TZJ_o5Y77QNfaYYYqppyFleoweE61THuaR7tIJ-4l-4ZQjnqF7IHcBuhEf_3NBDm-vh3pTbPfr93q1LUAbU_gqVM4yVTI0zgcmrLSGt05abVFz1np7fQde60oFI1CYigUhpXMISnsvFuT5b_ZGaYYcT5AvzS-puZHED331Rm4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval</title><source>arXiv.org</source><creator>Jung, Minjoon ; Choi, Seongho ; Kim, Joochan ; Kim, Jin-Hwa ; Zhang, Byoung-Tak</creator><creatorcontrib>Jung, Minjoon ; Choi, Seongho ; Kim, Joochan ; Kim, Jin-Hwa ; Zhang, Byoung-Tak</creatorcontrib><description>Video corpus moment retrieval (VCMR) is the task to retrieve the most
relevant video moment from a large video corpus using a natural language query.
For narrative videos, e.g., dramas or movies, the holistic understanding of
temporal dynamics and multimodal reasoning is crucial. Previous works have
shown promising results; however, they relied on the expensive query
annotations for VCMR, i.e., the corresponding moment intervals. To overcome
this problem, we propose a self-supervised learning framework: Modal-specific
Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal
moments via subtitle-based moment sampling. Then, it generates pseudo queries
exploiting both visual and textual information from the selected temporal
moments. Through the multimodal information in the pseudo queries, we show that
MPGN successfully learns to localize the video corpus moment without any
explicit annotation. We validate the effectiveness of MPGN on the TVR dataset,
showing competitive results compared with both supervised models and
unsupervised setting models.</description><identifier>DOI: 10.48550/arxiv.2210.12617</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2022-10</creationdate><rights>http://creativecommons.org/licenses/by-nc-sa/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,883</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2210.12617$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2210.12617$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jung, Minjoon</creatorcontrib><creatorcontrib>Choi, Seongho</creatorcontrib><creatorcontrib>Kim, Joochan</creatorcontrib><creatorcontrib>Kim, Jin-Hwa</creatorcontrib><creatorcontrib>Zhang, Byoung-Tak</creatorcontrib><title>Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval</title><description>Video corpus moment retrieval (VCMR) is the task to retrieve the most
relevant video moment from a large video corpus using a natural language query.
For narrative videos, e.g., dramas or movies, the holistic understanding of
temporal dynamics and multimodal reasoning is crucial. Previous works have
shown promising results; however, they relied on the expensive query
annotations for VCMR, i.e., the corresponding moment intervals. To overcome
this problem, we propose a self-supervised learning framework: Modal-specific
Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal
moments via subtitle-based moment sampling. Then, it generates pseudo queries
exploiting both visual and textual information from the selected temporal
moments. Through the multimodal information in the pseudo queries, we show that
MPGN successfully learns to localize the video corpus moment without any
explicit annotation. We validate the effectiveness of MPGN on the TVR dataset,
showing competitive results compared with both supervised models and
unsupervised setting models.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz01OwzAUBGBvWKDCAVjhC6TY8W-WVQQtUqsCqrqNnuNnyVIaR05S0dtTCquRZjGaj5AnzpbSKsVeIH_H87IsrwUvNTf3ZLNLHrpiHLCNIbb0Y8TZJ_o5Y77QNfaYYYqppyFleoweE61THuaR7tIJ-4l-4ZQjnqF7IHcBuhEf_3NBDm-vh3pTbPfr93q1LUAbU_gqVM4yVTI0zgcmrLSGt05abVFz1np7fQde60oFI1CYigUhpXMISnsvFuT5b_ZGaYYcT5AvzS-puZHED331Rm4</recordid><startdate>20221023</startdate><enddate>20221023</enddate><creator>Jung, Minjoon</creator><creator>Choi, Seongho</creator><creator>Kim, Joochan</creator><creator>Kim, Jin-Hwa</creator><creator>Zhang, Byoung-Tak</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20221023</creationdate><title>Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval</title><author>Jung, Minjoon ; Choi, Seongho ; Kim, Joochan ; Kim, Jin-Hwa ; Zhang, Byoung-Tak</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a677-d9f9b80520e7bdf0384871cb4868e610cd8221ad6695f73e3790f344bbea56dd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Jung, Minjoon</creatorcontrib><creatorcontrib>Choi, Seongho</creatorcontrib><creatorcontrib>Kim, Joochan</creatorcontrib><creatorcontrib>Kim, Jin-Hwa</creatorcontrib><creatorcontrib>Zhang, Byoung-Tak</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jung, Minjoon</au><au>Choi, Seongho</au><au>Kim, Joochan</au><au>Kim, Jin-Hwa</au><au>Zhang, Byoung-Tak</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval</atitle><date>2022-10-23</date><risdate>2022</risdate><abstract>Video corpus moment retrieval (VCMR) is the task to retrieve the most
relevant video moment from a large video corpus using a natural language query.
For narrative videos, e.g., dramas or movies, the holistic understanding of
temporal dynamics and multimodal reasoning is crucial. Previous works have
shown promising results; however, they relied on the expensive query
annotations for VCMR, i.e., the corresponding moment intervals. To overcome
this problem, we propose a self-supervised learning framework: Modal-specific
Pseudo Query Generation Network (MPGN). First, MPGN selects candidate temporal
moments via subtitle-based moment sampling. Then, it generates pseudo queries
exploiting both visual and textual information from the selected temporal
moments. Through the multimodal information in the pseudo queries, we show that
MPGN successfully learns to localize the video corpus moment without any
explicit annotation. We validate the effectiveness of MPGN on the TVR dataset,
showing competitive results compared with both supervised models and
unsupervised setting models.</abstract><doi>10.48550/arxiv.2210.12617</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2210.12617 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2210_12617 |
source | arXiv.org |
subjects | Computer Science - Computation and Language |
title | Modal-specific Pseudo Query Generation for Video Corpus Moment Retrieval |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T01%3A47%3A16IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Modal-specific%20Pseudo%20Query%20Generation%20for%20Video%20Corpus%20Moment%20Retrieval&rft.au=Jung,%20Minjoon&rft.date=2022-10-23&rft_id=info:doi/10.48550/arxiv.2210.12617&rft_dat=%3Carxiv_GOX%3E2210_12617%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |