Location-Aware Transformer Network for Bird's Eye View Semantic Segmentation

Bird's Eye View (BEV) segmentation with multiple surrounding cameras is crucial in autonomous driving due to its intuitive top-down view of the road environment. Despite the success of previous transformer-based networks, earlier works did not fully utilize the fact that each location on the BE...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on intelligent vehicles 2024-10, p.1-12
Hauptverfasser: Woo, Suhan, Park, Minseong, Lee, Youngjo, Lee, Seoungwon, Kim, Euntai
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Bird's Eye View (BEV) segmentation with multiple surrounding cameras is crucial in autonomous driving due to its intuitive top-down view of the road environment. Despite the success of previous transformer-based networks, earlier works did not fully utilize the fact that each location on the BEV map requires features at its own resolution from the perspective view (PV). For instance, high-resolution features are needed for distant locations, while low-resolution features are sufficient for nearby areas. Therefore, a suitable combination of different resolution features is necessary to accurately handle variations in appearance in the PV, such as scale, lighting, context, and occlusion across various locations. To address this, we propose a new BEV segmentation network named Location-Aware Transformer Network (LATNet). LATNet blends different resolutions of PV image features using Location-Aware Attention, leveraging the correlation between the resolution of PV features and their location on the BEV map. By doing so, LATNet can create robust features for any BEV map location. Additionally, we introduce Ego-Centric Aware Flip (ECA Flip), a novel augmentation strategy for multi-camera-based BEV segmentation. Conventional augmentation methods disrupt the geometric relationship between PV and BEV, making them unsuitable for multi-camera-based BEV segmentation tasks. In contrast, ECA Flip can augment the original data up to fourfold without compromising the geometric relationships, significantly boosting the network's performance. Our approach achieves state-of-the-art results with real-time inference speed for camerabased semantic segmentation on both the nuScenes and Argoverse datasets.
ISSN:2379-8858
2379-8904
DOI:10.1109/TIV.2024.3485122