Multi-camera Bird's Eye View Perception for Autonomous Driving
Most automated driving systems comprise a diverse sensor set, including several cameras, Radars, and LiDARs, ensuring a complete 360\deg coverage in near and far regions. Unlike Radar and LiDAR, which measure directly in 3D, cameras capture a 2D perspective projection with inherent depth ambiguity....
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Most automated driving systems comprise a diverse sensor set, including
several cameras, Radars, and LiDARs, ensuring a complete 360\deg coverage in
near and far regions. Unlike Radar and LiDAR, which measure directly in 3D,
cameras capture a 2D perspective projection with inherent depth ambiguity.
However, it is essential to produce perception outputs in 3D to enable the
spatial reasoning of other agents and structures for optimal path planning. The
3D space is typically simplified to the BEV space by omitting the less relevant
Z-coordinate, which corresponds to the height dimension.The most basic approach
to achieving the desired BEV representation from a camera image is IPM,
assuming a flat ground surface. Surround vision systems that are pretty common
in new vehicles use the IPM principle to generate a BEV image and to show it on
display to the driver. However, this approach is not suited for autonomous
driving since there are severe distortions introduced by this too-simplistic
transformation method. More recent approaches use deep neural networks to
output directly in BEV space. These methods transform camera images into BEV
space using geometric constraints implicitly or explicitly in the network. As
CNN has more context information and a learnable transformation can be more
flexible and adapt to image content, the deep learning-based methods set the
new benchmark for BEV transformation and achieve state-of-the-art performance.
First, this chapter discusses the contemporary trends of multi-camera-based DNN
(deep neural network) models outputting object representations directly in the
BEV space. Then, we discuss how this approach can extend to effective sensor
fusion and coupling downstream tasks like situation analysis and prediction.
Finally, we show challenges and open problems in BEV perception. |
---|---|
DOI: | 10.48550/arxiv.2309.09080 |