1st Place Solution for the 5th LSVOS Challenge: Video Instance Segmentation
Video instance segmentation is a challenging task that serves as the cornerstone of numerous downstream applications, including video editing and autonomous driving. In this report, we present further improvements to the SOTA VIS method, DVIS. First, we introduce a denoising training strategy for th...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Video instance segmentation is a challenging task that serves as the
cornerstone of numerous downstream applications, including video editing and
autonomous driving. In this report, we present further improvements to the SOTA
VIS method, DVIS. First, we introduce a denoising training strategy for the
trainable tracker, allowing it to achieve more stable and accurate object
tracking in complex and long videos. Additionally, we explore the role of
visual foundation models in video instance segmentation. By utilizing a frozen
VIT-L model pre-trained by DINO v2, DVIS demonstrates remarkable performance
improvements. With these enhancements, our method achieves 57.9 AP and 56.0 AP
in the development and test phases, respectively, and ultimately ranked 1st in
the VIS track of the 5th LSVOS Challenge. The code will be available at
https://github.com/zhang-tao-whu/DVIS. |
---|---|
DOI: | 10.48550/arxiv.2308.14392 |