Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving
Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of g...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Closed-set 3D perception models trained on only a pre-defined set of object
categories can be inadequate for safety critical applications such as
autonomous driving where new object types can be encountered after deployment.
In this paper, we present a multi-modal auto labeling pipeline capable of
generating amodal 3D bounding boxes and tracklets for training models on
open-set categories without 3D human labels. Our pipeline exploits motion cues
inherent in point cloud sequences in combination with the freely available 2D
image-text pairs to identify and track all traffic participants. Compared to
the recent studies in this domain, which can only provide class-agnostic auto
labels limited to moving objects, our method can handle both static and moving
objects in the unsupervised manner and is able to output open-vocabulary
semantic labels thanks to the proposed vision-language knowledge distillation.
Experiments on the Waymo Open Dataset show that our approach outperforms the
prior work by significant margins on various unsupervised 3D perception tasks. |
---|---|
DOI: | 10.48550/arxiv.2309.14491 |