Temporal DINO: A Self-supervised Video Strategy to Enhance Action Prediction
The emerging field of action prediction plays a vital role in various computer vision applications such as autonomous driving, activity analysis and human-computer interaction. Despite significant advancements, accurately predicting future actions remains a challenging problem due to high dimensiona...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The emerging field of action prediction plays a vital role in various
computer vision applications such as autonomous driving, activity analysis and
human-computer interaction. Despite significant advancements, accurately
predicting future actions remains a challenging problem due to high
dimensionality, complex dynamics and uncertainties inherent in video data.
Traditional supervised approaches require large amounts of labelled data, which
is expensive and time-consuming to obtain. This paper introduces a novel
self-supervised video strategy for enhancing action prediction inspired by DINO
(self-distillation with no labels). The Temporal-DINO approach employs two
models; a 'student' processing past frames; and a 'teacher' processing both
past and future frames, enabling a broader temporal context. During training,
the teacher guides the student to learn future context by only observing past
frames. The strategy is evaluated on ROAD dataset for the action prediction
downstream task using 3D-ResNet, Transformer, and LSTM architectures. The
experimental results showcase significant improvements in prediction
performance across these architectures, with our method achieving an average
enhancement of 9.9% Precision Points (PP), highlighting its effectiveness in
enhancing the backbones' capabilities of capturing long-term dependencies.
Furthermore, our approach demonstrates efficiency regarding the pretraining
dataset size and the number of epochs required. This method overcomes
limitations present in other approaches, including considering various backbone
architectures, addressing multiple prediction horizons, reducing reliance on
hand-crafted augmentations, and streamlining the pretraining process into a
single stage. These findings highlight the potential of our approach in diverse
video-based tasks such as activity recognition, motion planning, and scene
understanding. |
---|---|
DOI: | 10.48550/arxiv.2308.04589 |