CoBra: Complementary Branch Fusing Class and Semantic Knowledge for Robust Weakly Supervised Semantic Segmentation
Leveraging semantically precise pseudo masks derived from image-level class knowledge for segmentation, namely image-level Weakly Supervised Semantic Segmentation (WSSS), still remains challenging. While Class Activation Maps (CAMs) using CNNs have steadily been contributing to the success of WSSS,...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Leveraging semantically precise pseudo masks derived from image-level class
knowledge for segmentation, namely image-level Weakly Supervised Semantic
Segmentation (WSSS), still remains challenging. While Class Activation Maps
(CAMs) using CNNs have steadily been contributing to the success of WSSS, the
resulting activation maps often narrowly focus on class-specific parts (e.g.,
only face of human). On the other hand, recent works based on vision
transformers (ViT) have shown promising results based on their self-attention
mechanism to capture the semantic parts but fail in capturing complete
class-specific details (e.g., entire body parts of human but also with a dog
nearby). In this work, we propose Complementary Branch (CoBra), a novel dual
branch framework consisting of two distinct architectures which provide
valuable complementary knowledge of class (from CNN) and semantic (from ViT) to
each branch. In particular, we learn Class-Aware Projection (CAP) for the CNN
branch and Semantic-Aware Projection (SAP) for the ViT branch to explicitly
fuse their complementary knowledge and facilitate a new type of extra
patch-level supervision. Our model, through CoBra, fuses CNN and ViT's
complementary outputs to create robust pseudo masks that integrate both class
and semantic information effectively. Extensive experiments qualitatively and
quantitatively investigate how CNN and ViT complement each other on the PASCAL
VOC 2012 dataset, showing a state-of-the-art WSSS result. This includes not
only the masks generated by our model, but also the segmentation results
derived from utilizing these masks as pseudo labels. |
---|---|
DOI: | 10.48550/arxiv.2403.08801 |