Weakly supervised multi-class semantic video segmentation for road scenes
Weakly supervised multi-class video segmentation is one of the most challenging yet least studied research problems in computer vision. This study aims to investigate two main items: (1) effective feature update for temporal changes combined with feature reuse between temporal frames; and (2) learn...
Gespeichert in:
Veröffentlicht in: | Computer vision and image understanding 2023-04, Vol.230, p.103664, Article 103664 |
---|---|
Hauptverfasser: | , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Weakly supervised multi-class video segmentation is one of the most challenging yet least studied research problems in computer vision. This study aims to investigate two main items: (1) effective feature update for temporal changes combined with feature reuse between temporal frames; and (2) learn object patterns in complex scenes specifically for videos under weak supervision. Associating image tags to visual appearance is not a straightforward learning task, especially for complex scenes. Therefore, in this paper, we present manifold augmentations to obtain reliable pixel labels from image tags. We propose a framework comprised of two key modules: a temporal split module for efficient video processing and a pseudo per-pixel seed generation module for precise pixel-level supervision. Particularly, in our model, we utilize and explore temporal correlations via temporal split module and temporal attention. To reuse the extracted features and incorporate temporal updates for precise and fast computation, a channel-wise temporal split mechanism between successive video frames is presented. Furthermore, we evaluated proposed modules in two additional settings: (1) fully or sparsely supervised road scene video segmentation; and (2) weakly supervised segmentation for complex road scene images. Experiments are conducted on the Cityscapes and CamVid datasets, using DeepLabv3 as segmentation network and LiteFlowNet to compute motion vectors.
•A temporal split module is proposed for precise feature reuse among temporal frames.•Co-attention mechanism between video frames is proposed for discriminative features.•A two-fold refinement method is presented for pixel-level pseudo-labels generation.•Show state-of-the-art performance on Cityscapes and CamVid video benchmarks. |
---|---|
ISSN: | 1077-3142 1090-235X |
DOI: | 10.1016/j.cviu.2023.103664 |