LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation
It is widely agreed that open-vocabulary-based approaches outperform classical closed-set training solutions for recognizing unseen objects in images for semantic segmentation. Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich seman...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | It is widely agreed that open-vocabulary-based approaches outperform
classical closed-set training solutions for recognizing unseen objects in
images for semantic segmentation. Existing open-vocabulary approaches leverage
vision-language models, such as CLIP, to align visual features with rich
semantic features acquired through pre-training on large-scale vision-language
datasets. However, the text prompts employed in these methods are short phrases
based on fixed templates, failing to capture comprehensive object attributes.
Moreover, while the CLIP model excels at exploiting image-level features, it is
less effective at pixel-level representation, which is crucial for semantic
segmentation tasks. In this work, we propose to alleviate the above-mentioned
issues by leveraging multiple large-scale models to enhance the alignment
between fine-grained visual features and enriched linguistic features.
Specifically, our method employs large language models (LLMs) to generate
enriched language prompts with diverse visual attributes for each category,
including color, shape/size, and texture/material. Additionally, for enhanced
visual feature extraction, the SAM model is adopted as a supplement to the CLIP
visual encoder through a proposed learnable weighted fusion strategy. Built
upon these techniques, our method, termed LMSeg, achieves state-of-the-art
performance across all major open-vocabulary segmentation benchmarks. The code
will be made available soon. |
---|---|
DOI: | 10.48550/arxiv.2412.00364 |