Complementary branch fusing class and semantic knowledge for robust weakly supervised semantic segmentation
Leveraging semantically precise pseudo masks derived from image-level class knowledge for segmentation, namely image-level Weakly Supervised Semantic Segmentation (WSSS), remains challenging. Class Activation Maps (CAMs) using CNNs enhance WSSS by focusing on specific class parts like only the face...
Gespeichert in:
Veröffentlicht in: | Pattern recognition 2025-01, Vol.157, p.110922, Article 110922 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Leveraging semantically precise pseudo masks derived from image-level class knowledge for segmentation, namely image-level Weakly Supervised Semantic Segmentation (WSSS), remains challenging. Class Activation Maps (CAMs) using CNNs enhance WSSS by focusing on specific class parts like only the face of a human, whereas Vision Transformers (ViT) capture broader semantic parts but often miss complete class-specific details, such as human bodies with nearby objects like dogs. In this work, we propose a Complementary Branch (CoBra), a novel dual-branch framework consisting of two distinct architectures that provide valuable complementary knowledge of class (from CNN) and semantics (from ViT). In particular, we learn Class-Aware Projection (CAP) for the CNN branch and Semantic-Aware Projection (SAP) for the ViT branch, combining their insights to facilitate new patch-level supervision and create effective pseudo masks integrating class and semantic information. Extensive experiments qualitatively and quantitatively investigate how each branch complements the other, showing a significant result. Project Page and code are available: https://micv-yonsei.github.io/cobra2024/.
•We propose the Complementary Branch Framework, CoBra, which leverages and maximizes the strengths of both CNN and ViT while complementing each other’s limitations.•We utilize Class-Aware Projection (CAP) and Semantic-Aware Pro- jection (SAP) to capture class and semantic information, respectively, providing enhanced and complementary guidance to the CNN and ViT branches through the use of contrastive learning.•We test our model’s image-level WSSS performance on the PASCAL VOC 2012 dataset and MS-COCO 2014 which shows the significant results for seed, mask and segmentation results. |
---|---|
ISSN: | 0031-3203 |
DOI: | 10.1016/j.patcog.2024.110922 |