Towards Open-Vocabulary Video Instance Segmentation
Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we intro...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video Instance Segmentation (VIS) aims at segmenting and categorizing objects
in videos from a closed set of training categories, lacking the generalization
ability to handle novel categories in real-world videos. To address this
limitation, we make the following three contributions. First, we introduce the
novel task of Open-Vocabulary Video Instance Segmentation, which aims to
simultaneously segment, track, and classify objects in videos from open-set
categories, including novel categories unseen during training. Second, to
benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance
Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196
diverse categories, significantly surpassing the category size of existing
datasets by more than one order of magnitude. Third, we propose an efficient
Memory-Induced Transformer architecture, OV2Seg, to first achieve
Open-Vocabulary VIS in an end-to-end manner with near real-time inference
speed. Extensive experiments on LV-VIS and four existing VIS datasets
demonstrate the strong zero-shot generalization ability of OV2Seg on novel
categories. The dataset and code are released here
https://github.com/haochenheheda/LVVIS. |
---|---|
DOI: | 10.48550/arxiv.2304.01715 |