Efficient Video Instance Segmentation via Tracklet Query and Proposal
Video Instance Segmentation (VIS) aims to simultaneously classify, segment, and track multiple object instances in videos. Recent clip-level VIS takes a short video clip as input each time showing stronger performance than frame-level VIS (tracking-by-segmentation), as more temporal context from mul...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video Instance Segmentation (VIS) aims to simultaneously classify, segment,
and track multiple object instances in videos. Recent clip-level VIS takes a
short video clip as input each time showing stronger performance than
frame-level VIS (tracking-by-segmentation), as more temporal context from
multiple frames is utilized. Yet, most clip-level methods are neither
end-to-end learnable nor real-time. These limitations are addressed by the
recent VIS transformer (VisTR) which performs VIS end-to-end within a clip.
However, VisTR suffers from long training time due to its frame-wise dense
attention. In addition, VisTR is not fully end-to-end learnable in multiple
video clips as it requires a hand-crafted data association to link instance
tracklets between successive clips. This paper proposes EfficientVIS, a fully
end-to-end framework with efficient training and inference. At the core are
tracklet query and tracklet proposal that associate and segment
regions-of-interest (RoIs) across space and time by an iterative query-video
interaction. We further propose a correspondence learning that makes tracklets
linking between clips end-to-end learnable. Compared to VisTR, EfficientVIS
requires 15x fewer training epochs while achieving state-of-the-art accuracy on
the YouTube-VIS benchmark. Meanwhile, our method enables whole video instance
segmentation in a single end-to-end pass without data association at all. |
---|---|
DOI: | 10.48550/arxiv.2203.01853 |