Collaborative Debias Strategy for Temporal Sentence Grounding in Video

Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2024-11, Vol.34 (11), p.10972-10986
Hauptverfasser: Qi, Zhaobo, Yuan, Yibo, Ruan, Xiaowen, Wang, Shuhui, Zhang, Weigang, Huang, Qingming
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Temporal sentence grounding in video has witnessed significant advancements, but suffers from substantial dataset bias, which undermines its generalization ability. Existing debias approaches primarily concentrate on well-known distribution and linguistic biases, while overlooking the relationship among different biases, limiting their debias capability. In this work, we delve into the existence of visual bias and combinatorial bias in the widely used datasets, and introduce a collaborative debias structure that can be seamlessly integrated into present methods. It encompasses four low-capacity models, a re-label module, and a main model. Each biased model deliberately leverages bias as shortcut information to accurately perform grounding, achieved by customizing the appropriate model structure and input data format to align with the bias characteristics. During the training phase, the gradient descent direction for optimizing the main model should align with the negative gradient descent direction of the biased model that is optimized by utilizing ground truth labels. Subsequently, the re-label module introduces a gradient aggregation function, consolidating the gradient descent direction from these biased models and constructing new labels to compel the main model to effectively capture multi-modality alignment features instead of relying on shortcut contents for grounding. Finally, we design two debias structures, P-Debias and C-Debias, to exploit the independence and inclusion relationships between different types of biases. Extensive experiments on multiple span-based models over Charades-CD and ActivityNet-CD demonstrate the exceptional debias capability of our strategy ( https://github.com/qzhb/CDS ).
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2024.3413074