TransCAM: Transformer attention-based CAM refinement for Weakly supervised semantic segmentation

Weakly supervised semantic segmentation (WSSS) with only image-level supervision is a challenging task. Most existing methods exploit Class Activation Maps (CAM) to generate pixel-level pseudo labels for supervised training. However, due to the local receptive field of Convolution Neural Networks (C...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of visual communication and image representation 2023-04, Vol.92, p.103800, Article 103800
Hauptverfasser:	Li, Ruiwen, Mai, Zheda, Zhang, Zhibo, Jang, Jongseong, Sanner, Scott
Format:	Artikel
Sprache:	eng
Schlagworte:	Semantic segmentation Vision transformer Weakly supervised learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Weakly supervised semantic segmentation (WSSS) with only image-level supervision is a challenging task. Most existing methods exploit Class Activation Maps (CAM) to generate pixel-level pseudo labels for supervised training. However, due to the local receptive field of Convolution Neural Networks (CNN), CAM applied to CNNs often suffers from partial activation — highlighting the most discriminative part instead of the entire object area. In order to capture both local features and global representations, the Conformer has been proposed to combine a visual transformer branch with a CNN branch. In this paper, we propose TransCAM, a Conformer-based solution to WSSS that explicitly leverages the attention weights from the transformer branch of the Conformer to refine the CAM generated from the CNN branch. TransCAM is motivated by our observation that attention weights from shallow transformer blocks are able to capture low-level spatial feature similarities while attention weights from deep transformer blocks capture high-level semantic context. Despite its simplicity, TransCAM achieves competitive performance of 69.3% and 69.6% on the respective PASCAL VOC 2012 validation and test sets, showing the effectiveness of transformer attention-based refinement of CAM for WSSS. •We propose a transformer-based solution for Weakly Supervised Semantic Segmentation.•We utilize the attention weights from the transformer to refine the CAM.•We find different blocks’ attention weights capture distinct feature affinities.•Our method is simple yet effective, showing competitive results on PASCAL VOC 2012.
ISSN:	1047-3203 1095-9076
DOI:	10.1016/j.jvcir.2023.103800