Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks

The use of RGB-D information for salient object detection (SOD) has been extensively explored in recent years. However, relatively few efforts have been put toward modeling SOD in real-world human activity scenes with RGB-D. In this article, we fill the gap by making the following contributions to R...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transaction on neural networks and learning systems 2021-05, Vol.32 (5), p.2075-2089
Hauptverfasser:	Fan, Deng-Ping, Lin, Zheng, Zhang, Zhao, Zhu, Menglong, Cheng, Ming-Ming
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Benchmark Benchmark testing Benchmarking Benchmarks Cameras Color Computer Systems Data models Datasets Humans Image resolution Learning Learning systems Lighting Machine Learning Measurement Neural Networks, Computer Object recognition Pattern Recognition, Automated - methods RGB-D Salience saliency salient object detection (SOD) Salient Person (SIP) data set Smart phones
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The use of RGB-D information for salient object detection (SOD) has been extensively explored in recent years. However, relatively few efforts have been put toward modeling SOD in real-world human activity scenes with RGB-D. In this article, we fill the gap by making the following contributions to RGB-D SOD: 1) we carefully collect a new S al i ent P erson (SIP) data set that consists of ~1 K high-resolution images that cover diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and background s; 2) we conduct a large-scale (and, so far, the most comprehensive) benchmark comparing contemporary methods, which has long been missing in the field and can serve as a baseline for future research, and we systematically summarize 32 popular models and evaluate 18 parts of 32 models on seven data sets containing a total of about 97k images; and 3) we propose a simple general architecture, called deep depth-depurator network (D 3 Net). It consists of a depth depurator unit (DDU) and a three-stream feature learning module (FLM), which performs low-quality depth map filtering and cross-modal feature learning, respectively. These components form a nested structure and are elaborately designed to be learned jointly. D 3 Net exceeds the performance of any prior contenders across all five metrics under consideration, thus serving as a strong model to advance research in this field. We also demonstrate that D 3 Net can be used to efficiently extract salient object masks from real scenes, enabling effective background-changing application with a speed of 65 frames/s on a single GPU. All the saliency maps, our new SIP data set, the D 3 Net model, and the evaluation tools are publicly available at https://github.com/DengPingFan/D3NetBenchmark .
ISSN:	2162-237X 2162-2388
DOI:	10.1109/TNNLS.2020.2996406