Complementary branch fusing class and semantic knowledge for robust weakly supervised semantic segmentation

Leveraging semantically precise pseudo masks derived from image-level class knowledge for segmentation, namely image-level Weakly Supervised Semantic Segmentation (WSSS), remains challenging. Class Activation Maps (CAMs) using CNNs enhance WSSS by focusing on specific class parts like only the face...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2025-01, Vol.157, p.110922, Article 110922
Hauptverfasser: Han, Woojung, Kang, Seil, Choo, Kyobin, Hwang, Seong Jae
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Leveraging semantically precise pseudo masks derived from image-level class knowledge for segmentation, namely image-level Weakly Supervised Semantic Segmentation (WSSS), remains challenging. Class Activation Maps (CAMs) using CNNs enhance WSSS by focusing on specific class parts like only the face of a human, whereas Vision Transformers (ViT) capture broader semantic parts but often miss complete class-specific details, such as human bodies with nearby objects like dogs. In this work, we propose a Complementary Branch (CoBra), a novel dual-branch framework consisting of two distinct architectures that provide valuable complementary knowledge of class (from CNN) and semantics (from ViT). In particular, we learn Class-Aware Projection (CAP) for the CNN branch and Semantic-Aware Projection (SAP) for the ViT branch, combining their insights to facilitate new patch-level supervision and create effective pseudo masks integrating class and semantic information. Extensive experiments qualitatively and quantitatively investigate how each branch complements the other, showing a significant result. Project Page and code are available: https://micv-yonsei.github.io/cobra2024/. •We propose the Complementary Branch Framework, CoBra, which leverages and maximizes the strengths of both CNN and ViT while complementing each other’s limitations.•We utilize Class-Aware Projection (CAP) and Semantic-Aware Pro- jection (SAP) to capture class and semantic information, respectively, providing enhanced and complementary guidance to the CNN and ViT branches through the use of contrastive learning.•We test our model’s image-level WSSS performance on the PASCAL VOC 2012 dataset and MS-COCO 2014 which shows the significant results for seed, mask and segmentation results.
ISSN:0031-3203
DOI:10.1016/j.patcog.2024.110922