Bilateral Temporal Re-Aggregation for Weakly-Supervised Video Object Segmentation

Weakly-supervised video object segmentation is an emerging video task to track and segment the target given a simple bounding box label, which requires the method to fully catch and utilize the target information. Most existing approaches only rely on the guidance of a single frame and ignore the in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2022-07, Vol.32 (7), p.4498-4512
Hauptverfasser: Lin, Fanchao, Xie, Hongtao, Liu, Chuanbin, Zhang, Yongdong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Weakly-supervised video object segmentation is an emerging video task to track and segment the target given a simple bounding box label, which requires the method to fully catch and utilize the target information. Most existing approaches only rely on the guidance of a single frame and ignore the interaction between different frames when gathering information, making them hard to achieve reliable target representation. In this paper, we propose to capture the temporal dependencies and gather information from multiple frames through bilateral temporal re-aggregation. We explore three schemes to build the aggregation: 1) a two-stage re-aggregation mechanism is applied to provide target prior to the current frame, which obtains more valid feature matching and information aggregation; 2) a query-memory bilateral aggregation module is proposed to aggregate features from an unlimited amount of past frames and enable the mutual perception between different frames to validate the gathered information; 3) we guide the learning of aggregation modules through a novel cross-task representation distillation, transferring the knowledge from a semi-supervised model to our weakly-supervised model without increasing the inference latency. These schemes collaboratively build an efficient and competent aggregation process, thus we can fully exploit the video context to make the inference. Experimental results on four benchmarks show that our method achieves superior performance than previous methods and still maintains the efficiency ( e.g ., overall scores of 70.4% and 72.5% on the YouTube-VOS and DAVIS 2017 validation sets, respectively).
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2021.3127562