Weakly supervised multi-class semantic video segmentation for road scenes

Weakly supervised multi-class video segmentation is one of the most challenging yet least studied research problems in computer vision. This study aims to investigate two main items: (1) effective feature update for temporal changes combined with feature reuse between temporal frames; and (2) learn...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computer vision and image understanding 2023-04, Vol.230, p.103664, Article 103664
Hauptverfasser: Awan, Mehwish, Shin, Jitae
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Weakly supervised multi-class video segmentation is one of the most challenging yet least studied research problems in computer vision. This study aims to investigate two main items: (1) effective feature update for temporal changes combined with feature reuse between temporal frames; and (2) learn object patterns in complex scenes specifically for videos under weak supervision. Associating image tags to visual appearance is not a straightforward learning task, especially for complex scenes. Therefore, in this paper, we present manifold augmentations to obtain reliable pixel labels from image tags. We propose a framework comprised of two key modules: a temporal split module for efficient video processing and a pseudo per-pixel seed generation module for precise pixel-level supervision. Particularly, in our model, we utilize and explore temporal correlations via temporal split module and temporal attention. To reuse the extracted features and incorporate temporal updates for precise and fast computation, a channel-wise temporal split mechanism between successive video frames is presented. Furthermore, we evaluated proposed modules in two additional settings: (1) fully or sparsely supervised road scene video segmentation; and (2) weakly supervised segmentation for complex road scene images. Experiments are conducted on the Cityscapes and CamVid datasets, using DeepLabv3 as segmentation network and LiteFlowNet to compute motion vectors. •A temporal split module is proposed for precise feature reuse among temporal frames.•Co-attention mechanism between video frames is proposed for discriminative features.•A two-fold refinement method is presented for pixel-level pseudo-labels generation.•Show state-of-the-art performance on Cityscapes and CamVid video benchmarks.
ISSN:1077-3142
1090-235X
DOI:10.1016/j.cviu.2023.103664