Collaborative Debias Strategy for Temporal Sentence Grounding in Video

Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2024-11, Vol.34 (11), p.10972-10986
Hauptverfasser: Qi, Zhaobo, Yuan, Yibo, Ruan, Xiaowen, Wang, Shuhui, Zhang, Weigang, Huang, Qingming
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 10986
container_issue 11
container_start_page 10972
container_title IEEE transactions on circuits and systems for video technology
container_volume 34
creator Qi, Zhaobo
Yuan, Yibo
Ruan, Xiaowen
Wang, Shuhui
Zhang, Weigang
Huang, Qingming
description Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).
doi_str_mv 10.1109/TCSVT.2024.3413074
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10555372</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10555372</ieee_id><sourcerecordid>3133497587</sourcerecordid><originalsourceid>FETCH-LOGICAL-c162t-7400afb3d5b5f61b8cb80c0aacbe9a0c2e9ae13026abd58ea2e41cd130cb8e1a3</originalsourceid><addsrcrecordid>eNpNkE1LAzEQhoMoWKt_QDwEPG_NZzc9ymqrUPDQtdeQZGdLynZTk63gvze1PXjJZGbed2Z4ELqnZEIpmT3V1WpdTxhhYsIF5aQUF2hEpVQFY0Re5j-RtFCMymt0k9KWECqUKEdoXoWuMzZEM_hvwC9gvUl4NeQcNj-4DRHXsNvnfodX0A_QO8CLGA594_sN9j1e-wbCLbpqTZfg7hzH6HP-WldvxfJj8V49LwtHp2woSkGIaS1vpJXtlFrlrCKOGOMszAxxLL-Qz2dTYxupwDAQ1DW5koVADR-jx9PcfQxfB0iD3oZD7PNKzSnnYlZKVWYVO6lcDClFaPU--p2JP5oSfeSl_3jpIy995pVNDyeTB4B_BiklLxn_BU_bZ80</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3133497587</pqid></control><display><type>article</type><title>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</title><source>IEEE Electronic Library (IEL)</source><creator>Qi, Zhaobo ; Yuan, Yibo ; Ruan, Xiaowen ; Wang, Shuhui ; Zhang, Weigang ; Huang, Qingming</creator><creatorcontrib>Qi, Zhaobo ; Yuan, Yibo ; Ruan, Xiaowen ; Wang, Shuhui ; Zhang, Weigang ; Huang, Qingming</creatorcontrib><description>Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2024.3413074</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Bias ; Collaboration ; collaborative debias ; Combinatorial analysis ; combinatorial bias ; Data models ; Datasets ; Grounding ; Labels ; Modules ; Predictive models ; Proposals ; Sentences ; Task analysis ; Temporal sentence grounding in video ; Training ; visual bias ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-11, Vol.34 (11), p.10972-10986</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c162t-7400afb3d5b5f61b8cb80c0aacbe9a0c2e9ae13026abd58ea2e41cd130cb8e1a3</cites><orcidid>0000-0002-5931-0527 ; 0000-0003-0042-7074 ; 0000-0001-9196-9818 ; 0000-0001-7542-296X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10555372$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10555372$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Qi, Zhaobo</creatorcontrib><creatorcontrib>Yuan, Yibo</creatorcontrib><creatorcontrib>Ruan, Xiaowen</creatorcontrib><creatorcontrib>Wang, Shuhui</creatorcontrib><creatorcontrib>Zhang, Weigang</creatorcontrib><creatorcontrib>Huang, Qingming</creatorcontrib><title>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).</description><subject>Bias</subject><subject>Collaboration</subject><subject>collaborative debias</subject><subject>Combinatorial analysis</subject><subject>combinatorial bias</subject><subject>Data models</subject><subject>Datasets</subject><subject>Grounding</subject><subject>Labels</subject><subject>Modules</subject><subject>Predictive models</subject><subject>Proposals</subject><subject>Sentences</subject><subject>Task analysis</subject><subject>Temporal sentence grounding in video</subject><subject>Training</subject><subject>visual bias</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1LAzEQhoMoWKt_QDwEPG_NZzc9ymqrUPDQtdeQZGdLynZTk63gvze1PXjJZGbed2Z4ELqnZEIpmT3V1WpdTxhhYsIF5aQUF2hEpVQFY0Re5j-RtFCMymt0k9KWECqUKEdoXoWuMzZEM_hvwC9gvUl4NeQcNj-4DRHXsNvnfodX0A_QO8CLGA594_sN9j1e-wbCLbpqTZfg7hzH6HP-WldvxfJj8V49LwtHp2woSkGIaS1vpJXtlFrlrCKOGOMszAxxLL-Qz2dTYxupwDAQ1DW5koVADR-jx9PcfQxfB0iD3oZD7PNKzSnnYlZKVWYVO6lcDClFaPU--p2JP5oSfeSl_3jpIy995pVNDyeTB4B_BiklLxn_BU_bZ80</recordid><startdate>202411</startdate><enddate>202411</enddate><creator>Qi, Zhaobo</creator><creator>Yuan, Yibo</creator><creator>Ruan, Xiaowen</creator><creator>Wang, Shuhui</creator><creator>Zhang, Weigang</creator><creator>Huang, Qingming</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-5931-0527</orcidid><orcidid>https://orcid.org/0000-0003-0042-7074</orcidid><orcidid>https://orcid.org/0000-0001-9196-9818</orcidid><orcidid>https://orcid.org/0000-0001-7542-296X</orcidid></search><sort><creationdate>202411</creationdate><title>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</title><author>Qi, Zhaobo ; Yuan, Yibo ; Ruan, Xiaowen ; Wang, Shuhui ; Zhang, Weigang ; Huang, Qingming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c162t-7400afb3d5b5f61b8cb80c0aacbe9a0c2e9ae13026abd58ea2e41cd130cb8e1a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Bias</topic><topic>Collaboration</topic><topic>collaborative debias</topic><topic>Combinatorial analysis</topic><topic>combinatorial bias</topic><topic>Data models</topic><topic>Datasets</topic><topic>Grounding</topic><topic>Labels</topic><topic>Modules</topic><topic>Predictive models</topic><topic>Proposals</topic><topic>Sentences</topic><topic>Task analysis</topic><topic>Temporal sentence grounding in video</topic><topic>Training</topic><topic>visual bias</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Qi, Zhaobo</creatorcontrib><creatorcontrib>Yuan, Yibo</creatorcontrib><creatorcontrib>Ruan, Xiaowen</creatorcontrib><creatorcontrib>Wang, Shuhui</creatorcontrib><creatorcontrib>Zhang, Weigang</creatorcontrib><creatorcontrib>Huang, Qingming</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qi, Zhaobo</au><au>Yuan, Yibo</au><au>Ruan, Xiaowen</au><au>Wang, Shuhui</au><au>Zhang, Weigang</au><au>Huang, Qingming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-11</date><risdate>2024</risdate><volume>34</volume><issue>11</issue><spage>10972</spage><epage>10986</epage><pages>10972-10986</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2024.3413074</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-5931-0527</orcidid><orcidid>https://orcid.org/0000-0003-0042-7074</orcidid><orcidid>https://orcid.org/0000-0001-9196-9818</orcidid><orcidid>https://orcid.org/0000-0001-7542-296X</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1051-8215
ispartof IEEE transactions on circuits and systems for video technology, 2024-11, Vol.34 (11), p.10972-10986
issn 1051-8215
1558-2205
language eng
recordid cdi_ieee_primary_10555372
source IEEE Electronic Library (IEL)
subjects Bias
Collaboration
collaborative debias
Combinatorial analysis
combinatorial bias
Data models
Datasets
Grounding
Labels
Modules
Predictive models
Proposals
Sentences
Task analysis
Temporal sentence grounding in video
Training
visual bias
Visualization
title Collaborative Debias Strategy for Temporal Sentence Grounding in Video
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T11%3A08%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Collaborative%20Debias%20Strategy%20for%20Temporal%20Sentence%20Grounding%20in%20Video&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Qi,%20Zhaobo&rft.date=2024-11&rft.volume=34&rft.issue=11&rft.spage=10972&rft.epage=10986&rft.pages=10972-10986&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2024.3413074&rft_dat=%3Cproquest_RIE%3E3133497587%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3133497587&rft_id=info:pmid/&rft_ieee_id=10555372&rfr_iscdi=true