Fie-net: spatiotemporal full-stage interaction enhancement network for video salient object detection

In the task of video salient object detection, how to effectively fuse spatiotemporal cues is the key to successfully detecting salient objects. Existing methods suffer from inadequate fusion as well as focusing too much on a single piece of information, which makes them perform poorly in complex sc...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Signal, image and video processing image and video processing, 2024-09, Vol.18 (8-9), p.6321-6337
Hauptverfasser:	Wang, Jun, Sun, Chenhao, Wang, Haoyu, Ren, Xing, Huang, Ziqing, Li, Xiaoli
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Coders Computer Imaging Computer Science Flow mapping Image Processing and Computer Vision Multimedia Information Systems Object recognition Optical flow (image analysis) Original Paper Pattern Recognition and Graphics Salience Signal,Image and Speech Processing Spatiotemporal data Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In the task of video salient object detection, how to effectively fuse spatiotemporal cues is the key to successfully detecting salient objects. Existing methods suffer from inadequate fusion as well as focusing too much on a single piece of information, which makes them perform poorly in complex scenes. To address these issues, we propose a new spatiotemporal full-stage interaction enhancement network (FIE-Net) for video salient object detection. FIE-Net applies spatiotemporal information interaction deeply to the encoder–decoder stage, fully exploring the complementarity of spatiotemporal modalities. Specifically, we introduce a progressive attention guidance unit in the encoder part, which can adaptively fuse spatiotemporal features under a progressive structure for efficient interaction of spatiotemporal information. In the decoder part, we incorporate a cross-modal global refinement unit, which utilizes spatiotemporal global features to refine and complement the encoder features to obtain more complete salient information. In addition, we employ a multilevel information correction unit to further filter the input features using spatial low-level features and optical flow prediction maps to obtain more accurate salient information. We conducted experiments on four dataset benchmarks. The experimental results show that our method is highly competitive with current state-of-the-art algorithms.
ISSN:	1863-1703 1863-1711
DOI:	10.1007/s11760-024-03319-6