Zero-shot Referring Image Segmentation with Global-Local Context Features
Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this task, however, is notoriously costly and labor-intensive. To overcome this issue, we propose a simple yet effective zero-sho...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Referring image segmentation (RIS) aims to find a segmentation mask given a
referring expression grounded to a region of the input image. Collecting
labelled datasets for this task, however, is notoriously costly and
labor-intensive. To overcome this issue, we propose a simple yet effective
zero-shot referring image segmentation method by leveraging the pre-trained
cross-modal knowledge from CLIP. In order to obtain segmentation masks grounded
to the input text, we propose a mask-guided visual encoder that captures global
and local contextual information of an input image. By utilizing instance masks
obtained from off-the-shelf mask proposal techniques, our method is able to
segment fine-detailed Istance-level groundings. We also introduce a
global-local text encoder where the global feature captures complex
sentence-level semantics of the entire input expression while the local feature
focuses on the target noun phrase extracted by a dependency parser. In our
experiments, the proposed method outperforms several zero-shot baselines of the
task and even the weakly supervised referring expression segmentation method
with substantial margins. Our code is available at
https://github.com/Seonghoon-Yu/Zero-shot-RIS. |
---|---|
DOI: | 10.48550/arxiv.2303.17811 |