TDSNet: A temporal difference based network for video semantic segmentation

Video semantic segmentation (VSS) is a fundamental machine vision task with various practical applications, such as autonomous driving and automated surveillance. Current studies mainly utilize temporal features based on optical flow and the self-attention mechanism to improve VSS accuracy. However,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information sciences 2025-01, Vol.686, p.121335, Article 121335
Hauptverfasser: Yuan, Haochen, Peng, Junjie, Cai, Zesu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Video semantic segmentation (VSS) is a fundamental machine vision task with various practical applications, such as autonomous driving and automated surveillance. Current studies mainly utilize temporal features based on optical flow and the self-attention mechanism to improve VSS accuracy. However, these studies still face challenges like reduced accuracy and computational overhead due to inaccurate optical flow and the computational cost of the self-attention mechanism. To solve these problems, we propose a Temporal Difference Segmentation Net (TDSNet). Additionally, to improve accuracy and keep low computational costs, TDSNet employs temporal-difference-based temporal feature through the Temporal Feature Refine Module (TFRM). To further improve the accuracy of VSS, TDSNet adaptively fuses temporal features of varied motion magnitude with the Motion Magnitude Refine Module (MMRM). This module weighs and fuses temporal features of different magnitudes between frames. Extensive experimental results demonstrate that the comprehensive performance of TDSNet outperforms that of State-Of-The-Art (SOTA) VSS models on two large-scale public datasets: VSPW and Cityscapes. For instance, on VSPW, TDSNet achieves higher faster FPS than the SOTA model CFFM++ does by 13.2 frames per second, while the mIoU of the proposed model is only 0.7% lower than that of CFFM++. These results indicate promising performance in VSS applications. •The model TDSNet is constructed to accurately segment frames of the video.•The temporal-difference-based temporal features are utilized to improve accuracy of the segmentation.•The temporal features are adaptively aggregated according to motion information.•The experimental results indicate that the proposed work is able to accurately segment the target of the video.
ISSN:0020-0255
DOI:10.1016/j.ins.2024.121335