Prompt-guided DETR with RoI-pruned masked attention for open-vocabulary object detection

Prompt-OVD is an efficient and effective DETR-based framework for open-vocabulary object detection that utilizes class embeddings from CLIP as prompts, guiding the Transformer decoder to detect objects in base and novel classes. Additionally, our RoI-pruned masked attention helps leverage the zero-s...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern recognition 2024-11, Vol.155, p.110648, Article 110648
Hauptverfasser:	Song, Hwanjun, Bang, Jihwan
Format:	Artikel
Sprache:	eng
Schlagworte:	Object detection Open-vocabulary detection OVD Transformer
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Prompt-OVD is an efficient and effective DETR-based framework for open-vocabulary object detection that utilizes class embeddings from CLIP as prompts, guiding the Transformer decoder to detect objects in base and novel classes. Additionally, our RoI-pruned masked attention helps leverage the zero-shot classification ability of the Vision Transformer-based CLIP, resulting in improved detection performance at a minimal computational cost. Our experiments on the OV-COCO and OV-LVIS datasets demonstrate that Prompt-OVD achieves an impressive 21.2 times faster inference speed than the first end-to-end open-vocabulary detection method (OV-DETR), while also achieving higher APs than four two-stage methods operating within similar inference time ranges. We release the code at https://github.com/DISL-Lab/Prompt-OVD. •A prompt-guided decoding is proposed to keep a constant number of object queries.•The decoding reduces the computational overhead of the Transformer decoder.•RoI-pruned masked attention benefits from a pre-trained CLIP at a minimal cost.•We improve the efficiency and accuracy of the end-to-end DETR-based OVD method.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2024.110648