Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection
The introduction of DETR represents a new paradigm for object detection. However, its decoder conducts classification and box localization using shared queries and cross-attention layers, leading to suboptimal results. We observe that different regions of interest in the visual feature map are suita...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The introduction of DETR represents a new paradigm for object detection.
However, its decoder conducts classification and box localization using shared
queries and cross-attention layers, leading to suboptimal results. We observe
that different regions of interest in the visual feature map are suitable for
performing query classification and box localization tasks, even for the same
object. Salient regions provide vital information for classification, while the
boundaries around them are more favorable for box regression. Unfortunately,
such spatial misalignment between these two tasks greatly hinders DETR's
training. Therefore, in this work, we focus on decoupling localization and
classification tasks in DETR. To achieve this, we introduce a new design scheme
called spatially decoupled DETR (SD-DETR), which includes a task-aware query
generation module and a disentangled feature learning process. We elaborately
design the task-aware query initialization process and divide the
cross-attention block in the decoder to allow the task-aware queries to match
different visual regions. Meanwhile, we also observe that the prediction
misalignment problem for high classification confidence and precise
localization exists, so we propose an alignment loss to further guide the
spatially decoupled DETR training. Through extensive experiments, we
demonstrate that our approach achieves a significant improvement in MSCOCO
datasets compared to previous work. For instance, we improve the performance of
Conditional DETR by 4.5 AP. By spatially disentangling the two tasks, our
method overcomes the misalignment problem and greatly improves the performance
of DETR for object detection. |
---|---|
DOI: | 10.48550/arxiv.2310.15955 |