Modality-Balanced Embedding for Video Retrieval

SIGIR, 2022 Video search has become the main routine for users to discover videos relevant to a text query on large short-video sharing platforms. During training a query-video bi-encoder model using online search logs, we identify a modality bias phenomenon that the video encoder almost entirely re...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Wang, Xun, Ke, Bingqing, Li, Xuanping, Liu, Fangyu, Zhang, Mingyu, Liang, Xiao, Xiao, Qiushi, Luo, Cheng, Yu, Yue
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Information Retrieval Statistics - Machine Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Wang, Xun Ke, Bingqing Li, Xuanping Liu, Fangyu Zhang, Mingyu Liang, Xiao Xiao, Qiushi Luo, Cheng Yu, Yue
description	SIGIR, 2022 Video search has become the main routine for users to discover videos relevant to a text query on large short-video sharing platforms. During training a query-video bi-encoder model using online search logs, we identify a modality bias phenomenon that the video encoder almost entirely relies on text matching, neglecting other modalities of the videos such as vision, audio. This modality imbalanceresults from a) modality gap: the relevance between a query and a video text is much easier to learn as the query is also a piece of text, with the same modality as the video text; b) data bias: most training samples can be solved solely by text matching. Here we share our practices to improve the first retrieval stage including our solution for the modality imbalance issue. We propose MBVR (short for Modality Balanced Video Retrieval) with two key components: manually generated modality-shuffled (MS) samples and a dynamic margin (DM) based on visual relevance. They can encourage the video encoder to pay balanced attentions to each modality. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving modality bias problem. We have also deployed our MBVR in a large video platform and observed statistically significant boost over a highly optimized baseline in an A/B test and manual GSB evaluations.
doi_str_mv	10.48550/arxiv.2204.08182
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2204_08182</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2204_08182</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-7efc6d8d2a2ad6c8410f077ad73f89a3b27ea5ee2fdfe2c7a28150e9f8ffd1ae3</originalsourceid><addsrcrecordid>eNotzrsKwkAQQNFtLET9ACvzA4m7k8eupYovUAQR2zBmZmQhGllF9O_FR3W7y1Gqb3SSuTzXQwxP_0gAdJZoZxy01XDTENb-_oonWOOlYopm5yMT-cspkiZEB0_cRDu-B88PrLuqJVjfuPdvR-3ns_10Ga-3i9V0vI6xsBBblqogR4CAVFQuM1q0tUg2FTfC9AiWMWcGIWGoLIIzueaROBEyyGlHDX7br7i8Bn_G8Co_8vIrT98I3z6d</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Modality-Balanced Embedding for Video Retrieval</title><source>arXiv.org</source><creator>Wang, Xun ; Ke, Bingqing ; Li, Xuanping ; Liu, Fangyu ; Zhang, Mingyu ; Liang, Xiao ; Xiao, Qiushi ; Luo, Cheng ; Yu, Yue</creator><creatorcontrib>Wang, Xun ; Ke, Bingqing ; Li, Xuanping ; Liu, Fangyu ; Zhang, Mingyu ; Liang, Xiao ; Xiao, Qiushi ; Luo, Cheng ; Yu, Yue</creatorcontrib><description>SIGIR, 2022 Video search has become the main routine for users to discover videos relevant to a text query on large short-video sharing platforms. During training a query-video bi-encoder model using online search logs, we identify a modality bias phenomenon that the video encoder almost entirely relies on text matching, neglecting other modalities of the videos such as vision, audio. This modality imbalanceresults from a) modality gap: the relevance between a query and a video text is much easier to learn as the query is also a piece of text, with the same modality as the video text; b) data bias: most training samples can be solved solely by text matching. Here we share our practices to improve the first retrieval stage including our solution for the modality imbalance issue. We propose MBVR (short for Modality Balanced Video Retrieval) with two key components: manually generated modality-shuffled (MS) samples and a dynamic margin (DM) based on visual relevance. They can encourage the video encoder to pay balanced attentions to each modality. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving modality bias problem. We have also deployed our MBVR in a large video platform and observed statistically significant boost over a highly optimized baseline in an A/B test and manual GSB evaluations.</description><identifier>DOI: 10.48550/arxiv.2204.08182</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Information Retrieval ; Statistics - Machine Learning</subject><creationdate>2022-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2204.08182$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2204.08182$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Xun</creatorcontrib><creatorcontrib>Ke, Bingqing</creatorcontrib><creatorcontrib>Li, Xuanping</creatorcontrib><creatorcontrib>Liu, Fangyu</creatorcontrib><creatorcontrib>Zhang, Mingyu</creatorcontrib><creatorcontrib>Liang, Xiao</creatorcontrib><creatorcontrib>Xiao, Qiushi</creatorcontrib><creatorcontrib>Luo, Cheng</creatorcontrib><creatorcontrib>Yu, Yue</creatorcontrib><title>Modality-Balanced Embedding for Video Retrieval</title><description>SIGIR, 2022 Video search has become the main routine for users to discover videos relevant to a text query on large short-video sharing platforms. During training a query-video bi-encoder model using online search logs, we identify a modality bias phenomenon that the video encoder almost entirely relies on text matching, neglecting other modalities of the videos such as vision, audio. This modality imbalanceresults from a) modality gap: the relevance between a query and a video text is much easier to learn as the query is also a piece of text, with the same modality as the video text; b) data bias: most training samples can be solved solely by text matching. Here we share our practices to improve the first retrieval stage including our solution for the modality imbalance issue. We propose MBVR (short for Modality Balanced Video Retrieval) with two key components: manually generated modality-shuffled (MS) samples and a dynamic margin (DM) based on visual relevance. They can encourage the video encoder to pay balanced attentions to each modality. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving modality bias problem. We have also deployed our MBVR in a large video platform and observed statistically significant boost over a highly optimized baseline in an A/B test and manual GSB evaluations.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Information Retrieval</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotzrsKwkAQQNFtLET9ACvzA4m7k8eupYovUAQR2zBmZmQhGllF9O_FR3W7y1Gqb3SSuTzXQwxP_0gAdJZoZxy01XDTENb-_oonWOOlYopm5yMT-cspkiZEB0_cRDu-B88PrLuqJVjfuPdvR-3ns_10Ga-3i9V0vI6xsBBblqogR4CAVFQuM1q0tUg2FTfC9AiWMWcGIWGoLIIzueaROBEyyGlHDX7br7i8Bn_G8Co_8vIrT98I3z6d</recordid><startdate>20220418</startdate><enddate>20220418</enddate><creator>Wang, Xun</creator><creator>Ke, Bingqing</creator><creator>Li, Xuanping</creator><creator>Liu, Fangyu</creator><creator>Zhang, Mingyu</creator><creator>Liang, Xiao</creator><creator>Xiao, Qiushi</creator><creator>Luo, Cheng</creator><creator>Yu, Yue</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20220418</creationdate><title>Modality-Balanced Embedding for Video Retrieval</title><author>Wang, Xun ; Ke, Bingqing ; Li, Xuanping ; Liu, Fangyu ; Zhang, Mingyu ; Liang, Xiao ; Xiao, Qiushi ; Luo, Cheng ; Yu, Yue</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-7efc6d8d2a2ad6c8410f077ad73f89a3b27ea5ee2fdfe2c7a28150e9f8ffd1ae3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Information Retrieval</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Xun</creatorcontrib><creatorcontrib>Ke, Bingqing</creatorcontrib><creatorcontrib>Li, Xuanping</creatorcontrib><creatorcontrib>Liu, Fangyu</creatorcontrib><creatorcontrib>Zhang, Mingyu</creatorcontrib><creatorcontrib>Liang, Xiao</creatorcontrib><creatorcontrib>Xiao, Qiushi</creatorcontrib><creatorcontrib>Luo, Cheng</creatorcontrib><creatorcontrib>Yu, Yue</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Xun</au><au>Ke, Bingqing</au><au>Li, Xuanping</au><au>Liu, Fangyu</au><au>Zhang, Mingyu</au><au>Liang, Xiao</au><au>Xiao, Qiushi</au><au>Luo, Cheng</au><au>Yu, Yue</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Modality-Balanced Embedding for Video Retrieval</atitle><date>2022-04-18</date><risdate>2022</risdate><abstract>SIGIR, 2022 Video search has become the main routine for users to discover videos relevant to a text query on large short-video sharing platforms. During training a query-video bi-encoder model using online search logs, we identify a modality bias phenomenon that the video encoder almost entirely relies on text matching, neglecting other modalities of the videos such as vision, audio. This modality imbalanceresults from a) modality gap: the relevance between a query and a video text is much easier to learn as the query is also a piece of text, with the same modality as the video text; b) data bias: most training samples can be solved solely by text matching. Here we share our practices to improve the first retrieval stage including our solution for the modality imbalance issue. We propose MBVR (short for Modality Balanced Video Retrieval) with two key components: manually generated modality-shuffled (MS) samples and a dynamic margin (DM) based on visual relevance. They can encourage the video encoder to pay balanced attentions to each modality. Through extensive experiments on a real world dataset, we show empirically that our method is both effective and efficient in solving modality bias problem. We have also deployed our MBVR in a large video platform and observed statistically significant boost over a highly optimized baseline in an A/B test and manual GSB evaluations.</abstract><doi>10.48550/arxiv.2204.08182</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2204.08182
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2204_08182
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Information Retrieval Statistics - Machine Learning
title	Modality-Balanced Embedding for Video Retrieval
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T01%3A56%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Modality-Balanced%20Embedding%20for%20Video%20Retrieval&rft.au=Wang,%20Xun&rft.date=2022-04-18&rft_id=info:doi/10.48550/arxiv.2204.08182&rft_dat=%3Carxiv_GOX%3E2204_08182%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true