Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video
Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investi...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Self-supervised learning has unlocked the potential of scaling up pretraining
to billions of images, since annotation is unnecessary. But are we making the
best use of data? How more economical can we be? In this work, we attempt to
answer this question by making two contributions. First, we investigate
first-person videos and introduce a "Walking Tours" dataset. These videos are
high-resolution, hours-long, captured in a single uninterrupted take, depicting
a large number of objects and actions with natural scene transitions. They are
unlabeled and uncurated, thus realistic for self-supervision and comparable
with human learning.
Second, we introduce a novel self-supervised image pretraining method
tailored for learning from continuous videos. Existing methods typically adapt
image-based pretraining approaches to incorporate more frames. Instead, we
advocate a "tracking to learn to recognize" approach. Our method called DoRA,
leads to attention maps that Discover and tRAck objects over time in an
end-to-end manner, using transformer cross-attention. We derive multiple views
from the tracks and use them in a classical self-supervised distillation loss.
Using our novel approach, a single Walking Tours video remarkably becomes a
strong competitor to ImageNet for several image and video downstream tasks. |
---|---|
DOI: | 10.48550/arxiv.2310.08584 |