Collaborative Debias Strategy for Temporal Sentence Grounding in Video
Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship a...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on circuits and systems for video technology 2024-11, Vol.34 (11), p.10972-10986 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 10986 |
---|---|
container_issue | 11 |
container_start_page | 10972 |
container_title | IEEE transactions on circuits and systems for video technology |
container_volume | 34 |
creator | Qi, Zhaobo Yuan, Yibo Ruan, Xiaowen Wang, Shuhui Zhang, Weigang Huang, Qingming |
description | Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ). |
doi_str_mv | 10.1109/TCSVT.2024.3413074 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_10555372</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10555372</ieee_id><sourcerecordid>3133497587</sourcerecordid><originalsourceid>FETCH-LOGICAL-c162t-7400afb3d5b5f61b8cb80c0aacbe9a0c2e9ae13026abd58ea2e41cd130cb8e1a3</originalsourceid><addsrcrecordid>eNpNkE1LAzEQhoMoWKt_QDwEPG_NZzc9ymqrUPDQtdeQZGdLynZTk63gvze1PXjJZGbed2Z4ELqnZEIpmT3V1WpdTxhhYsIF5aQUF2hEpVQFY0Re5j-RtFCMymt0k9KWECqUKEdoXoWuMzZEM_hvwC9gvUl4NeQcNj-4DRHXsNvnfodX0A_QO8CLGA594_sN9j1e-wbCLbpqTZfg7hzH6HP-WldvxfJj8V49LwtHp2woSkGIaS1vpJXtlFrlrCKOGOMszAxxLL-Qz2dTYxupwDAQ1DW5koVADR-jx9PcfQxfB0iD3oZD7PNKzSnnYlZKVWYVO6lcDClFaPU--p2JP5oSfeSl_3jpIy995pVNDyeTB4B_BiklLxn_BU_bZ80</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3133497587</pqid></control><display><type>article</type><title>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</title><source>IEEE Electronic Library (IEL)</source><creator>Qi, Zhaobo ; Yuan, Yibo ; Ruan, Xiaowen ; Wang, Shuhui ; Zhang, Weigang ; Huang, Qingming</creator><creatorcontrib>Qi, Zhaobo ; Yuan, Yibo ; Ruan, Xiaowen ; Wang, Shuhui ; Zhang, Weigang ; Huang, Qingming</creatorcontrib><description>Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2024.3413074</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Bias ; Collaboration ; collaborative debias ; Combinatorial analysis ; combinatorial bias ; Data models ; Datasets ; Grounding ; Labels ; Modules ; Predictive models ; Proposals ; Sentences ; Task analysis ; Temporal sentence grounding in video ; Training ; visual bias ; Visualization</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-11, Vol.34 (11), p.10972-10986</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c162t-7400afb3d5b5f61b8cb80c0aacbe9a0c2e9ae13026abd58ea2e41cd130cb8e1a3</cites><orcidid>0000-0002-5931-0527 ; 0000-0003-0042-7074 ; 0000-0001-9196-9818 ; 0000-0001-7542-296X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10555372$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10555372$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Qi, Zhaobo</creatorcontrib><creatorcontrib>Yuan, Yibo</creatorcontrib><creatorcontrib>Ruan, Xiaowen</creatorcontrib><creatorcontrib>Wang, Shuhui</creatorcontrib><creatorcontrib>Zhang, Weigang</creatorcontrib><creatorcontrib>Huang, Qingming</creatorcontrib><title>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).</description><subject>Bias</subject><subject>Collaboration</subject><subject>collaborative debias</subject><subject>Combinatorial analysis</subject><subject>combinatorial bias</subject><subject>Data models</subject><subject>Datasets</subject><subject>Grounding</subject><subject>Labels</subject><subject>Modules</subject><subject>Predictive models</subject><subject>Proposals</subject><subject>Sentences</subject><subject>Task analysis</subject><subject>Temporal sentence grounding in video</subject><subject>Training</subject><subject>visual bias</subject><subject>Visualization</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1LAzEQhoMoWKt_QDwEPG_NZzc9ymqrUPDQtdeQZGdLynZTk63gvze1PXjJZGbed2Z4ELqnZEIpmT3V1WpdTxhhYsIF5aQUF2hEpVQFY0Re5j-RtFCMymt0k9KWECqUKEdoXoWuMzZEM_hvwC9gvUl4NeQcNj-4DRHXsNvnfodX0A_QO8CLGA594_sN9j1e-wbCLbpqTZfg7hzH6HP-WldvxfJj8V49LwtHp2woSkGIaS1vpJXtlFrlrCKOGOMszAxxLL-Qz2dTYxupwDAQ1DW5koVADR-jx9PcfQxfB0iD3oZD7PNKzSnnYlZKVWYVO6lcDClFaPU--p2JP5oSfeSl_3jpIy995pVNDyeTB4B_BiklLxn_BU_bZ80</recordid><startdate>202411</startdate><enddate>202411</enddate><creator>Qi, Zhaobo</creator><creator>Yuan, Yibo</creator><creator>Ruan, Xiaowen</creator><creator>Wang, Shuhui</creator><creator>Zhang, Weigang</creator><creator>Huang, Qingming</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-5931-0527</orcidid><orcidid>https://orcid.org/0000-0003-0042-7074</orcidid><orcidid>https://orcid.org/0000-0001-9196-9818</orcidid><orcidid>https://orcid.org/0000-0001-7542-296X</orcidid></search><sort><creationdate>202411</creationdate><title>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</title><author>Qi, Zhaobo ; Yuan, Yibo ; Ruan, Xiaowen ; Wang, Shuhui ; Zhang, Weigang ; Huang, Qingming</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c162t-7400afb3d5b5f61b8cb80c0aacbe9a0c2e9ae13026abd58ea2e41cd130cb8e1a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Bias</topic><topic>Collaboration</topic><topic>collaborative debias</topic><topic>Combinatorial analysis</topic><topic>combinatorial bias</topic><topic>Data models</topic><topic>Datasets</topic><topic>Grounding</topic><topic>Labels</topic><topic>Modules</topic><topic>Predictive models</topic><topic>Proposals</topic><topic>Sentences</topic><topic>Task analysis</topic><topic>Temporal sentence grounding in video</topic><topic>Training</topic><topic>visual bias</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Qi, Zhaobo</creatorcontrib><creatorcontrib>Yuan, Yibo</creatorcontrib><creatorcontrib>Ruan, Xiaowen</creatorcontrib><creatorcontrib>Wang, Shuhui</creatorcontrib><creatorcontrib>Zhang, Weigang</creatorcontrib><creatorcontrib>Huang, Qingming</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Qi, Zhaobo</au><au>Yuan, Yibo</au><au>Ruan, Xiaowen</au><au>Wang, Shuhui</au><au>Zhang, Weigang</au><au>Huang, Qingming</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Collaborative Debias Strategy for Temporal Sentence Grounding in Video</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-11</date><risdate>2024</risdate><volume>34</volume><issue>11</issue><spage>10972</spage><epage>10986</epage><pages>10972-10986</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2024.3413074</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0002-5931-0527</orcidid><orcidid>https://orcid.org/0000-0003-0042-7074</orcidid><orcidid>https://orcid.org/0000-0001-9196-9818</orcidid><orcidid>https://orcid.org/0000-0001-7542-296X</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1051-8215 |
ispartof | IEEE transactions on circuits and systems for video technology, 2024-11, Vol.34 (11), p.10972-10986 |
issn | 1051-8215 1558-2205 |
language | eng |
recordid | cdi_ieee_primary_10555372 |
source | IEEE Electronic Library (IEL) |
subjects | Bias Collaboration collaborative debias Combinatorial analysis combinatorial bias Data models Datasets Grounding Labels Modules Predictive models Proposals Sentences Task analysis Temporal sentence grounding in video Training visual bias Visualization |
title | Collaborative Debias Strategy for Temporal Sentence Grounding in Video |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-28T11%3A08%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Collaborative%20Debias%20Strategy%20for%20Temporal%20Sentence%20Grounding%20in%20Video&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Qi,%20Zhaobo&rft.date=2024-11&rft.volume=34&rft.issue=11&rft.spage=10972&rft.epage=10986&rft.pages=10972-10986&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2024.3413074&rft_dat=%3Cproquest_RIE%3E3133497587%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3133497587&rft_id=info:pmid/&rft_ieee_id=10555372&rfr_iscdi=true |