Transformer Fusion and Pixel-level Contrastive Learning for RGB-D Salient Object Detection

Current RGB-D salient object detection (RGB-D SOD) methods mainly develop a generalizable model trained by binary cross-entropy (BCE) loss based on convolutional or Transformer backbones. However, they usually exploit convolutional modules to fuse multi-modality features, with little attention paid...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2024-01, Vol.26, p.1-16
Hauptverfasser:	Wu, Jiesheng, Hao, Fangwei, Liang, Weiyun, Xu, Jing
Format:	Artikel
Sprache:	eng
Schlagworte:	Computational complexity Computer vision Embedding Feature extraction Fuses Learning Modules multi-modality fusion Object detection Object recognition pixel-level contrastive learning Pixels RGB-D salient object detection Salience Task analysis Transformer Transformers
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Current RGB-D salient object detection (RGB-D SOD) methods mainly develop a generalizable model trained by binary cross-entropy (BCE) loss based on convolutional or Transformer backbones. However, they usually exploit convolutional modules to fuse multi-modality features, with little attention paid to capturing the long-range multi-modality interactions for feature fusion. Furthermore, BCE loss does not explicitly explore intra- and inter-pixel relationships in a joint embedding space. To address these issues, we propose a cross-modality interaction parallel-transformer (CIPT) module, which better captures the long-range multi-modality interactions, generating more comprehensive fusion features. Besides, we propose a pixel-level contrastive learning (PCL) method that improves inter-pixel discrimination and intra-pixel compactness, resulting in a well-structured embedding space and a better saliency detector. Specifically, we propose an asymmetric network (TPCL) for RGB-D SOD, which consists of a Swin V2 Transformer-based backbone and a designed lightweight backbone (LDNet). Moreover, an edge-guided module and a feature enhancement (FE) module are proposed to refine the learned fusion features. Extensive experiments demonstrate that our method achieves excellent performance against 15 state-of-the-art methods on seven public datasets. We expect our work to facilitate the exploration of applying Transformer and contrastive learning for RGB-D SOD tasks. Our codes and predicted saliency maps will be released on GitHub https://github.com/TomorrowJW/TPCL_RGBDSOD .
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2023.3275308