Exploring Recurrent Long-Term Temporal Fusion for Multi-View 3D Perception

Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overhead...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE robotics and automation letters 2024-07, Vol.9 (7), p.6544-6551
Hauptverfasser:	Han, Chunrui, Yang, Jinrong, Sun, Jianjian, Ge, Zheng, Dong, Runpei, Zhou, Hongyu, Mao, Weixin, Peng, Yuang, Zhang, Xiangyu
Format:	Artikel
Sprache:	eng
Schlagworte:	Cameras Detectors Feature extraction Fuses History Multi-view 3D object detection Object recognition Pipelines recurrent network and long-term temporal fusion Space perception Task analysis Three-dimensional displays
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this letter, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains strong performance on various camera-based 3D perception tasks, including object detection ( 55.4% mAP and 62.9% NDS), segmentation ( 48.6% vehicle mIoU), tracking ( 54.8% AMOTA), and motion prediction ( 0.80 m minADE and 0.463 EPA).
ISSN:	2377-3766 2377-3766
DOI:	10.1109/LRA.2024.3401172