Prototype as query for few shot semantic segmentation
Few-shot Semantic Segmentation (FSS) was proposed to segment unseen classes in a query image, referring to only a few annotated examples named support images. One of the characteristics of FSS is spatial inconsistency between query and support targets, e.g., texture or appearance. This greatly chall...
Gespeichert in:
Veröffentlicht in: | Complex & Intelligent Systems 2024-10, Vol.10 (5), p.7265-7278 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Few-shot Semantic Segmentation (FSS) was proposed to segment unseen classes in a query image, referring to only a few annotated examples named support images. One of the characteristics of FSS is spatial inconsistency between query and support targets, e.g., texture or appearance. This greatly challenges the generalization ability of methods for FSS, which requires to effectively exploit the dependency of the query image and the support examples. Most existing methods abstracted support features into prototype vectors and implemented the interaction with query features using cosine similarity or feature concatenation. However, this simple interaction may not capture spatial details in query features. To address this limitation, some methods utilized pixel-level support information by computing pixel-level correlations between paired query and support features implemented with the attention mechanism of Transformer. Nevertheless, these approaches suffer from heavy computation due to dot-product attention between all pixels of support and query features. In this paper, we propose a novel framework, termed ProtoFormer, built upon the Transformer architecture, to fully capture spatial details in query features. ProtoFormer treats the abstracted prototype of the target class in support features as the Query and the query features as Key and Value embeddings, which are input to the Transformer decoder. This approach enables better capture of spatial details and focuses on the semantic features of the target class in the query image. The output of the Transformer-based module can be interpreted as semantic-aware dynamic kernels that filter the segmentation mask from the enriched query features. Extensive experiments conducted on PASCAL-
5
i
and COCO-
20
i
datasets demonstrate that ProtoFormer significantly outperforms the state-of-the-art methods in FSS. |
---|---|
ISSN: | 2199-4536 2198-6053 |
DOI: | 10.1007/s40747-024-01539-4 |