Collaborative Debias Strategy for Temporal Sentence Grounding in Video

Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship a...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-11, Vol.34 (11), p.10972-10986
Hauptverfasser:	Qi, Zhaobo, Yuan, Yibo, Ruan, Xiaowen, Wang, Shuhui, Zhang, Weigang, Huang, Qingming
Format:	Artikel
Sprache:	eng
Schlagworte:	Bias Collaboration collaborative debias Combinatorial analysis combinatorial bias Data models Datasets Grounding Labels Modules Predictive models Proposals Sentences Task analysis Temporal sentence grounding in video Training visual bias Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	10986
container_issue	11
container_start_page	10972
container_title	IEEE transactions on circuits and systems for video technology
container_volume	34
creator	Qi, Zhaobo Yuan, Yibo Ruan, Xiaowen Wang, Shuhui Zhang, Weigang Huang, Qingming
description	Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).
doi_str_mv	10.1109/TCSVT.2024.3413074
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10555372</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10555372</ieee_id><sourcerecordid>3133497587</sourcerecordid><originalsourceid>FETCH-LOGICAL-c162t-7400afb3d5b5f61b8cb80c0aacbe9a0c2e9ae13026abd58ea2e41cd130cb8e1a3</originalsourceid><addsrcrecordid>eNpNkE1LAzEQhoMoWKt_QDwEPG_NZzc9ymqrUPDQtdeQZGdLynZTk63gvze1PXjJZGbed2Z4ELqnZEIpmT3V1WpdTxhhYsIF5aQUF2hEpVQFY0Re5j-RtFCMymt0k9KWECqUKEdoXoWuMzZEM_hvwC9gvUl4NeQcNj-4DRHXsNvnfodX0A_QO8CLGA594_sN9j1e-wbCLbpqTZfg7hzH6HP-WldvxfJj8V49LwtHp2woSkGIaS1vpJXtlFrlrCKOGOMszAxxLL-Qz2dTYxupwDAQ1DW5koVADR-jx9PcfQxfB0iD3oZD7PNKzSnnYlZKVWYVO6lcDClFaPU--p2JP5oSfeSl_3jpIy995pVNDyeTB4B_BiklLxn_BU_bZ80</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3133497587</pqid></control><display><type>article</type><title>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</title><source>IEEE Electronic Library (IEL)</source><creator>Qi, Zhaobo ; Yuan, Yibo ; Ruan, Xiaowen ; Wang, Shuhui ; Zhang, Weigang ; Huang, Qingming</creator><creatorcontrib>Qi, Zhaobo ; Yuan, Yibo ; Ruan, Xiaowen ; Wang, Shuhui ; Zhang, Weigang ; Huang, Qingming</creatorcontrib><description>Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2024.3413074</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Bias ; Collaboration ; collaborative debias ; Combinatorial analysis ; combinatorial bias ; Data models ; Datasets ; Grounding ; Labels ; Modules ; Predictive models ; Proposals ; Sentences ; Task analysis ; Temporal sentence grounding in video ; Training ; visual bias ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-11, Vol.34 (11), p.10972-10986</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c162t-7400afb3d5b5f61b8cb80c0aacbe9a0c2e9ae13026abd58ea2e41cd130cb8e1a3</cites><orcidid>0000-0002-5931-0527 ; 0000-0003-0042-7074 ; 0000-0001-9196-9818 ; 0000-0001-7542-296X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10555372$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10555372$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Qi, Zhaobo</creatorcontrib><creatorcontrib>Yuan, Yibo</creatorcontrib><creatorcontrib>Ruan, Xiaowen</creatorcontrib><creatorcontrib>Wang, Shuhui</creatorcontrib><creatorcontrib>Zhang, Weigang</creatorcontrib><creatorcontrib>Huang, Qingming</creatorcontrib><title>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).</description><subject>Bias</subject><subject>Collaboration</subject><subject>collaborative debias</subject><subject>Combinatorial analysis</subject><subject>combinatorial bias</subject><subject>Data models</subject><subject>Datasets</subject><subject>Grounding</subject><subject>Labels</subject><subject>Modules</subject><subject>Predictive models</subject><subject>Proposals</subject><subject>Sentences</subject><subject>Task analysis</subject><subject>Temporal sentence grounding in video</subject><subject>Training</subject><subject>visual bias</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1LAzEQhoMoWKt_QDwEPG_NZzc9ymqrUPDQtdeQZGdLynZTk63gvze1PXjJZGbed2Z4ELqnZEIpmT3V1WpdTxhhYsIF5aQUF2hEpVQFY0Re5j-RtFCMymt0k9KWECqUKEdoXoWuMzZEM_hvwC9gvUl4NeQcNj-4DRHXsNvnfodX0A_QO8CLGA594_sN9j1e-wbCLbpqTZfg7hzH6HP-WldvxfJj8V49LwtHp2woSkGIaS1vpJXtlFrlrCKOGOMszAxxLL-Qz2dTYxupwDAQ1DW5koVADR-jx9PcfQxfB0iD3oZD7PNKzSnnYlZKVWYVO6lcDClFaPU--p2JP5oSfeSl_3jpIy995pVNDyeTB4B_BiklLxn_BU_bZ80</recordid><startdate>202411</startdate><enddate>202411</enddate><creator>Qi, Zhaobo</creator><creator>Yuan, Yibo</creator><creator>Ruan, Xiaowen</creator><creator>Wang, Shuhui</creator><creator>Zhang, Weigang</creator><creator>Huang, Qingming</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-5931-0527</orcidid><orcidid>https://orcid.org/0000-0003-0042-7074</orcidid><orcidid>https://orcid.org/0000-0001-9196-9818</orcidid><orcidid>https://orcid.org/0000-0001-7542-296X</orcidid></search><sort><creationdate>202411</creationdate><title>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</title><author>Qi, Zhaobo ; Yuan, Yibo ; Ruan, Xiaowen ; Wang, Shuhui ; Zhang, Weigang ; Huang, Qingming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c162t-7400afb3d5b5f61b8cb80c0aacbe9a0c2e9ae13026abd58ea2e41cd130cb8e1a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Bias</topic><topic>Collaboration</topic><topic>collaborative debias</topic><topic>Combinatorial analysis</topic><topic>combinatorial bias</topic><topic>Data models</topic><topic>Datasets</topic><topic>Grounding</topic><topic>Labels</topic><topic>Modules</topic><topic>Predictive models</topic><topic>Proposals</topic><topic>Sentences</topic><topic>Task analysis</topic><topic>Temporal sentence grounding in video</topic><topic>Training</topic><topic>visual bias</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Qi, Zhaobo</creatorcontrib><creatorcontrib>Yuan, Yibo</creatorcontrib><creatorcontrib>Ruan, Xiaowen</creatorcontrib><creatorcontrib>Wang, Shuhui</creatorcontrib><creatorcontrib>Zhang, Weigang</creatorcontrib><creatorcontrib>Huang, Qingming</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qi, Zhaobo</au><au>Yuan, Yibo</au><au>Ruan, Xiaowen</au><au>Wang, Shuhui</au><au>Zhang, Weigang</au><au>Huang, Qingming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-11</date><risdate>2024</risdate><volume>34</volume><issue>11</issue><spage>10972</spage><epage>10986</epage><pages>10972-10986</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2024.3413074</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-5931-0527</orcidid><orcidid>https://orcid.org/0000-0003-0042-7074</orcidid><orcidid>https://orcid.org/0000-0001-9196-9818</orcidid><orcidid>https://orcid.org/0000-0001-7542-296X</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2024-11, Vol.34 (11), p.10972-10986
issn	1051-8215 1558-2205
language	eng
recordid	cdi_ieee_primary_10555372
source	IEEE Electronic Library (IEL)
subjects	Bias Collaboration collaborative debias Combinatorial analysis combinatorial bias Data models Datasets Grounding Labels Modules Predictive models Proposals Sentences Task analysis Temporal sentence grounding in video Training visual bias Visualization
title	Collaborative Debias Strategy for Temporal Sentence Grounding in Video
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T11%3A08%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Collaborative%20Debias%20Strategy%20for%20Temporal%20Sentence%20Grounding%20in%20Video&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Qi,%20Zhaobo&rft.date=2024-11&rft.volume=34&rft.issue=11&rft.spage=10972&rft.epage=10986&rft.pages=10972-10986&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2024.3413074&rft_dat=%3Cproquest_RIE%3E3133497587%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3133497587&rft_id=info:pmid/&rft_ieee_id=10555372&rfr_iscdi=true