Future pedestrian location prediction in first-person videos for autonomous vehicles and social robots

•Combining the depth information of the image to map the 2D image into the 3D space.•Pedestrian poses and spatial interactions are fused into a multi-channel tensor.•A convoolutional transformer is constructed to improve the prediction precision.•The proposed model establishes a global time dependen...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Image and vision computing 2023-06, Vol.134, p.104671, Article 104671
Hauptverfasser: Chen, Kai, Zhu, Haihua, Tang, Dunbing, Zheng, Kun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Combining the depth information of the image to map the 2D image into the 3D space.•Pedestrian poses and spatial interactions are fused into a multi-channel tensor.•A convoolutional transformer is constructed to improve the prediction precision.•The proposed model establishes a global time dependency between input and output.•The proposed model can predict the depth of pedestrian navigation in the image. Future pedestrian trajectory prediction in first-person videos offers great prospects to help autonomous vehicles and social robots to enable better human-vehicle interactions. Given an egocentric video stream, we aim to predict the location and depth (distance between the observed person and the camera) of his/her neighbors in future frames. To locate their future trajectories, we mainly consider three main factors: a) It is necessary to restore the spatial distribution of pedestrians in 2D image to 3D space, i.e., to extract the distance between the pedestrian and the camera which is often neglected. b) It is critical to utilize neighbors’ poses to recognize their intentions. c) It is important to learn human-vehicle interactions from the pedestrian’s historical trajectories. We propose to incorporate these three factors into a multi-channel tensor to represent the main features in real-life 3D space. We then put this tensor into an innovative end-to-end fully convolutional network based on transformer architecture. Experimental results reveal our method outperforms other state-of-the-art methods on public benchmarks MOT15, MOT16 and MOT17. The proposed method will be useful to understand human-vehicle interaction and helpful for pedestrian collision avoidance.
ISSN:0262-8856
1872-8138
DOI:10.1016/j.imavis.2023.104671