Location-Aware Transformer Network for Bird's Eye View Semantic Segmentation
Bird's Eye View (BEV) segmentation with multiple surrounding cameras is crucial in autonomous driving due to its intuitive top-down view of the road environment. Despite the success of previous transformer-based networks, earlier works did not fully utilize the fact that each location on the BE...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on intelligent vehicles 2024-10, p.1-12 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Bird's Eye View (BEV) segmentation with multiple surrounding cameras is crucial in autonomous driving due to its intuitive top-down view of the road environment. Despite the success of previous transformer-based networks, earlier works did not fully utilize the fact that each location on the BEV map requires features at its own resolution from the perspective view (PV). For instance, high-resolution features are needed for distant locations, while low-resolution features are sufficient for nearby areas. Therefore, a suitable combination of different resolution features is necessary to accurately handle variations in appearance in the PV, such as scale, lighting, context, and occlusion across various locations. To address this, we propose a new BEV segmentation network named Location-Aware Transformer Network (LATNet). LATNet blends different resolutions of PV image features using Location-Aware Attention, leveraging the correlation between the resolution of PV features and their location on the BEV map. By doing so, LATNet can create robust features for any BEV map location. Additionally, we introduce Ego-Centric Aware Flip (ECA Flip), a novel augmentation strategy for multi-camera-based BEV segmentation. Conventional augmentation methods disrupt the geometric relationship between PV and BEV, making them unsuitable for multi-camera-based BEV segmentation tasks. In contrast, ECA Flip can augment the original data up to fourfold without compromising the geometric relationships, significantly boosting the network's performance. Our approach achieves state-of-the-art results with real-time inference speed for camerabased semantic segmentation on both the nuScenes and Argoverse datasets. |
---|---|
ISSN: | 2379-8858 2379-8904 |
DOI: | 10.1109/TIV.2024.3485122 |