DINO-MOT: 3D Multi-Object Tracking With Visual Foundation Model for Pedestrian Re-Identification Using Visual Memory Mechanism

In the advancing domain of autonomous driving, this research focuses on enhancing 3D Multi-Object Tracking (3D-MOT). Pedestrians are particularly vulnerable in urban environments, and robust tracking methodologies are required to understand their movements. Prevalent Tracking-By-Detection (TBD) fram...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE robotics and automation letters 2025-02, Vol.10 (2), p.1202-1208
Hauptverfasser: Lee, Min Young, Lee, Christina Dao Wen, Jianghao, Li, Ang, Marcelo H.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In the advancing domain of autonomous driving, this research focuses on enhancing 3D Multi-Object Tracking (3D-MOT). Pedestrians are particularly vulnerable in urban environments, and robust tracking methodologies are required to understand their movements. Prevalent Tracking-By-Detection (TBD) frameworks often underutilize the rich visual data from sensors such as cameras. This study leverages the advanced visual foundation model, DINOv2, to refine the TBD framework by incorporating camera modality, thereby improving pedestrian tracking consistency and overall 3D-MOT performance. The proposed DINO-MOT framework is the first application of DINOv2 for enhancing 3D-MOT through pedestrian Re-Identification (Re-ID), and Score Filter Ceiling is implemented to prevent premature exclusion of low-confidence 3D detections during tracking association. Furthermore, utilization of DINOv2 as a feature extractor within the DINO-MOT framework reduces pedestrian ID switches by up to 12.3%. Achieving AMOTA of 76.3% on the nuScenes test dataset, DINO-MOT has set a new benchmark in the 3D MOT literature with an improvement of 0.5%, securing the top rank on the leaderboard. Furthermore, this research paves the potential of applying a visual foundation model to improve the existing TBD framework, to enhance 3D-MOT in autonomous driving.
ISSN:2377-3766
2377-3766
DOI:10.1109/LRA.2024.3500882