Deformable Part Region Learning and Feature Aggregation Tree Representation for Object Detection

Region-based object detection infers object regions for one or more categories in an image. Due to the recent advances in deep learning and region proposal methods, object detectors based on convolutional neural networks (CNNs) have been flourishing and provided promising detection results. However,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on pattern analysis and machine intelligence 2023-09, Vol.45 (9), p.1-18
1. Verfasser: Bae, Seung-Hwan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Region-based object detection infers object regions for one or more categories in an image. Due to the recent advances in deep learning and region proposal methods, object detectors based on convolutional neural networks (CNNs) have been flourishing and provided promising detection results. However, the accuracy of the convolutional object detectors can be degraded often due to the low feature discriminability caused by geometric variation or transformation of an object. In this paper, we propose a deformable part region (DPR) learning in order to allow decomposed part regions to be deformable according to the geometric transformation of an object. Because the ground truth of the part models is not available in many cases, we design part model losses for the detection and segmentation, and learn the geometric parameters by minimizing an integral loss including those part losses. As a result, we can train our DPR network without extra supervision, and make multi-part models deformable according to object geometric variation. Moreover, we propose a novel feature aggregation tree (FAT) so as to learn more discriminative region of interest (RoI) features via bottom-up tree construction. The FAT can learn the stronger semantic features by aggregating part RoI features along the bottom-up pathways of the tree. We also present a spatial and channel attention mechanism for the aggregation between different node features. Based on the proposed DPR and FAT networks, we design a new cascade architecture that can refine detection tasks iteratively. Without bells and whistles, we achieve impressive detection and segmentation results on MSCOCO and PASCAL VOC datasets. Our Cascade D-PRD achieves the 57.9 box AP with the Swin-L backbone. We also provide an extensive ablation study to prove the effectiveness and usefulness of the proposed methods for large-scale object detection.
ISSN:0162-8828
1939-3539
2160-9292
DOI:10.1109/TPAMI.2023.3268864