DVIS++: Improved Decoupled Framework for Universal Video Segmentation
We present the \textbf{D}ecoupled \textbf{VI}deo \textbf{S}egmentation (DVIS) framework, a novel approach for the challenging task of universal video segmentation, including video instance segmentation (VIS), video semantic segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous m...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We present the \textbf{D}ecoupled \textbf{VI}deo \textbf{S}egmentation (DVIS)
framework, a novel approach for the challenging task of universal video
segmentation, including video instance segmentation (VIS), video semantic
segmentation (VSS), and video panoptic segmentation (VPS). Unlike previous
methods that model video segmentation in an end-to-end manner, our approach
decouples video segmentation into three cascaded sub-tasks: segmentation,
tracking, and refinement. This decoupling design allows for simpler and more
effective modeling of the spatio-temporal representations of objects,
especially in complex scenes and long videos. Accordingly, we introduce two
novel components: the referring tracker and the temporal refiner. These
components track objects frame by frame and model spatio-temporal
representations based on pre-aligned features. To improve the tracking
capability of DVIS, we propose a denoising training strategy and introduce
contrastive learning, resulting in a more robust framework named DVIS++.
Furthermore, we evaluate DVIS++ in various settings, including open vocabulary
and using a frozen pre-trained backbone. By integrating CLIP with DVIS++, we
present OV-DVIS++, the first open-vocabulary universal video segmentation
framework. We conduct extensive experiments on six mainstream benchmarks,
including the VIS, VSS, and VPS datasets. Using a unified architecture, DVIS++
significantly outperforms state-of-the-art specialized methods on these
benchmarks in both close- and open-vocabulary settings.
Code:~\url{https://github.com/zhang-tao-whu/DVIS_Plus}. |
---|---|
DOI: | 10.48550/arxiv.2312.13305 |