LFTransNet: Light Field Salient Object Detection via a Learnable Weight Descriptor

Light Field Salient Object Detection (LF SOD) aims to segment the visually distinctive objects out of surroundings. Since light field images provide a multi-focus stack (many focal slices in different depth levels) and an all-focus image for the same scene, they record comprehensive but redundant in...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on circuits and systems for video technology 2023-12, Vol.33 (12), p.1-1
Hauptverfasser: Liu, Zhengyi, He, Qian, Wang, Linbo, Fang, Xianong, Tang, Bin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Light Field Salient Object Detection (LF SOD) aims to segment the visually distinctive objects out of surroundings. Since light field images provide a multi-focus stack (many focal slices in different depth levels) and an all-focus image for the same scene, they record comprehensive but redundant information. Existing methods exploit the useful cue by long short-term memory with attention mechanism, 3D convolution, and graph learning. However, the importance of intra-slice and inter-slice in the focal stack is not well investigated. In the paper, we propose a learnable weight descriptor to simultaneously exploit different weights in slice, spatial region, and channel dimensions, and therefore propose an LF SOD method based on the learnable descriptor. The method extracts slice features and all-focus features from a weight-shared backbone and another backbone, respectively. A transformer decoder is used to learn the weight descriptor which both emphasizes the importance of each slice (inter-slice) and discriminates the spatial and channel importance of each slice (intra-slice). The learnt descriptor serves as the weight to make slice features attend to important slices, regions, and channels. Furthermore, we propose the hierarchical multi-modal fusion which aggregates high-layer features by modelling the long-range dependency to fully excavate common salient semantics and combines low-layer features by spatial constraint to eliminate the blurring effect of slice features. The experimental result exceeds the state-of-the-art methods at least 25% in terms of mean absolute error evaluation metric. It demonstrates a significant improvement in LF SOD performance via the designed learnable weight descriptor. https://github.com/liuzywen/LFTransNet.
ISSN:1051-8215
1558-2205
DOI:10.1109/TCSVT.2023.3281465