Semantic Segmentation in Thermal Videos: A New Benchmark and Multi-Granularity Contrastive Learning-Based Framework

Video semantic segmentation has achieved great success, which is significant for road scene understanding. However, semantic segmentation remains challenging in poor illumination and inclement weather. Thermal camera, highly invariant to light and highly penetrating to rain and fog, enables semantic...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on intelligent transportation systems 2023-12, Vol.24 (12), p.14783-14799
Hauptverfasser: Zheng, Yu, Zhou, Fugen, Liang, Shangying, Song, Wentao, Bai, Xiangzhi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Video semantic segmentation has achieved great success, which is significant for road scene understanding. However, semantic segmentation remains challenging in poor illumination and inclement weather. Thermal camera, highly invariant to light and highly penetrating to rain and fog, enables semantic segmentation to work under challenging conditions. Thus, this paper explores semantic segmentation in thermal videos to broaden the scope of the application of road scene understanding. We offer the first thermal video semantic segmentation dataset TVSS including 1695 thermal videos with 50850 frames in road scenes. It is available at: https://xzbai.buaa.edu.cn/datasets.html . TVSS is finely annotated by 17 categories at the frame rate of 1fps, with a labeled pixel density of 98.9%. Existing video semantic segmentation methods rely on the amount of labels and the representation power of backbones, which cannot achieve ideal results on thermal videos. Thus, we introduce a multi-granularity contrastive learning based thermal video semantic segmentation model (MGCL), which explores the abundant unlabeled frames to boost the supervised segmentation. Specifically, MGCL constructs multi-granularity self-supervised signals on unlabeled thermal videos by contrastive learning, including the intra-frame context generalization loss, the intra-clip temporal consistency loss, and the inter-video category discrimination loss. In addition, a hard anchor sampling strategy is introduced to focus on hard-classify pixels for further performance improvement. Extensive experiments on TVSS demonstrate the superior performance of MGCL in both accuracy and efficiency. Compared to the 12 state-of-the-art semantic segmentation methods, MGCL achieves 2.8% to 8.1% gains in mIoU performance while maintaining the inference speed.
ISSN:1524-9050
1558-0016
DOI:10.1109/TITS.2023.3300038