PileNet: A high-and-low pass complementary filter with multi-level feature refinement for salient object detection

Multi-head self-attentions (MSAs) in Transformer are low-pass filters, which will tend to reduce high-frequency signals. Convolutional layers (Convs) in Convolutional Neural Network (CNN) are high-pass filters, which will tend to capture high-frequency components of the images. Therefore, CNN and Tr...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of visual communication and image representation 2024-06, Vol.102, p.104186, Article 104186
Hauptverfasser: Yang, Xiaoqi, Duan, Liangliang, Zhou, Quanqiang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Multi-head self-attentions (MSAs) in Transformer are low-pass filters, which will tend to reduce high-frequency signals. Convolutional layers (Convs) in Convolutional Neural Network (CNN) are high-pass filters, which will tend to capture high-frequency components of the images. Therefore, CNN and Transformer contain complementary information, and the combination of the two is necessary for satisfactory detection results. In this work, we propose a novel framework PileNet that efficiently combine CNN and Transformer for accurate salient object detection (SOD). Specifically in PileNet, we introduce complementary encoder that extracts multi-level complementary saliency features. Next, we simplify the complementary features by adjusting the number of channels for all features to a fixed value. By introducing the multi-level feature aggregation (MLFA) and multi-level feature refinement (MLFR) units, the low- and high-level features can easily be transmitted to feature blocks at various pyramid levels. Finally, we fuse all the refined saliency features in a Unet-like structure from top to bottom and use multi-point supervision mechanism to produce the final saliency maps. Extensive experimental results over five widely used saliency benchmark datasets clearly demonstrate that our proposed model can accurately locate the entire salient objects with clear object boundaries and outperform sixteen previous state-of-the-art saliency methods in terms of a wide range of metrics. •A high-and-low pass complementary filter is used to generate encoders.•We design an effective multi-level feature refinement unit.•We design a multi-level feature aggregation unit with shared parameters.•A multi-point supervision mechanism is proposed to generate saliency maps.
ISSN:1047-3203
1095-9076
DOI:10.1016/j.jvcir.2024.104186