DiffusionVID: Denoising Object Boxes With Spatio-Temporal Conditioning for Video Object Detection

Several existing still image object detectors suffer from image deterioration in videos, such as motion blur, camera defocus, and partial occlusion. We present DiffusionVID, a diffusion model-based video object detector that exploits spatio-temporal conditioning. Inspired by the diffusion model, Dif...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2023, Vol.11, p.121434-121444
Hauptverfasser:	Roh, Si-Dong, Chung, Ki-Seok
Format:	Artikel
Sprache:	eng
Schlagworte:	Blurring Boxes Condition monitoring Conditioning coreset Decoding Detectors diffusion model Feature extraction Noise reduction Object detection Object recognition Occlusion Proposals Random noise Spatiotemporal phenomena spatio–temporal Task analysis Transformer cores Video video object detection
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Several existing still image object detectors suffer from image deterioration in videos, such as motion blur, camera defocus, and partial occlusion. We present DiffusionVID, a diffusion model-based video object detector that exploits spatio-temporal conditioning. Inspired by the diffusion model, DiffusionVID refines random noise boxes to obtain the original object boxes in a video sequence. To effectively refine the object boxes from the degraded images in the videos, we used three novel approaches: cascade refinement, dynamic coreset conditioning, and local batch refinement. The cascade refinement architecture progressively extracts information and refines boxes, whereas the dynamic coreset conditioning further improves the denoising quality using adaptive conditions based on the spatio-temporal coreset. Local batch refinement significantly improves the inference speed by exploiting GPU parallelism. On the standard and widely used ImageNet-VID benchmark, our DiffusionVID with the ResNet-101 and Swin-Base backbones achieves 86.9 mAP @ 46.6 FPS and 92.4 mAP @ 27.0 FPS, respectively, which is state-of-the-art performance. To the best of the authors' knowledge, this is the first video object detector based on a diffusion model. The code and models are available at https://github.com/sdroh1027/DiffusionVID .
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2023.3328341