Latent Variable Sequential Set Transformers For Joint Multi-Agent Motion Prediction
Robust multi-agent trajectory prediction is essential for the safe control of robotic systems. A major challenge is to efficiently learn a representation that approximates the true joint distribution of contextual, social, and temporal information to enable planning. We propose Latent Variable Seque...
Gespeichert in:
Hauptverfasser: | , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Robust multi-agent trajectory prediction is essential for the safe control of
robotic systems. A major challenge is to efficiently learn a representation
that approximates the true joint distribution of contextual, social, and
temporal information to enable planning. We propose Latent Variable Sequential
Set Transformers which are encoder-decoder architectures that generate
scene-consistent multi-agent trajectories. We refer to these architectures as
"AutoBots". The encoder is a stack of interleaved temporal and social
multi-head self-attention (MHSA) modules which alternately perform equivariant
processing across the temporal and social dimensions. The decoder employs
learnable seed parameters in combination with temporal and social MHSA modules
allowing it to perform inference over the entire future scene in a single
forward pass efficiently. AutoBots can produce either the trajectory of one
ego-agent or a distribution over the future trajectories for all agents in the
scene. For the single-agent prediction case, our model achieves top results on
the global nuScenes vehicle motion prediction leaderboard, and produces strong
results on the Argoverse vehicle prediction challenge. In the multi-agent
setting, we evaluate on the synthetic partition of TrajNet++ dataset to
showcase the model's socially-consistent predictions. We also demonstrate our
model on general sequences of sets and provide illustrative experiments
modelling the sequential structure of the multiple strokes that make up symbols
in the Omniglot data. A distinguishing feature of AutoBots is that all models
are trainable on a single desktop GPU (1080 Ti) in under 48h. |
---|---|
DOI: | 10.48550/arxiv.2104.00563 |