Unleashing Generalization of End-to-End Autonomous Driving with Controllable Long Video Generation
Using generative models to synthesize new data has become a de-facto standard in autonomous driving to address the data scarcity issue. Though existing approaches are able to boost perception models, we discover that these approaches fail to improve the performance of planning of end-to-end autonomo...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Using generative models to synthesize new data has become a de-facto standard
in autonomous driving to address the data scarcity issue. Though existing
approaches are able to boost perception models, we discover that these
approaches fail to improve the performance of planning of end-to-end autonomous
driving models as the generated videos are usually less than 8 frames and the
spatial and temporal inconsistencies are not negligible. To this end, we
propose Delphi, a novel diffusion-based long video generation method with a
shared noise modeling mechanism across the multi-views to increase spatial
consistency, and a feature-aligned module to achieves both precise
controllability and temporal consistency. Our method can generate up to 40
frames of video without loss of consistency which is about 5 times longer
compared with state-of-the-art methods. Instead of randomly generating new
data, we further design a sampling policy to let Delphi generate new data that
are similar to those failure cases to improve the sample efficiency. This is
achieved by building a failure-case driven framework with the help of
pre-trained visual language models. Our extensive experiment demonstrates that
our Delphi generates a higher quality of long videos surpassing previous
state-of-the-art methods. Consequentially, with only generating 4% of the
training dataset size, our framework is able to go beyond perception and
prediction tasks, for the first time to the best of our knowledge, boost the
planning performance of the end-to-end autonomous driving model by a margin of
25%. |
---|---|
DOI: | 10.48550/arxiv.2406.01349 |