Aligned Unsupervised Pretraining of Object Detectors with Self-training
The unsupervised pretraining of object detectors has recently become a key component of object detector training, as it leads to improved performance and faster convergence during the supervised fine-tuning stage. Existing unsupervised pretraining methods, however, typically rely on low-level inform...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The unsupervised pretraining of object detectors has recently become a key
component of object detector training, as it leads to improved performance and
faster convergence during the supervised fine-tuning stage. Existing
unsupervised pretraining methods, however, typically rely on low-level
information to define proposals that are used to train the detector.
Furthermore, in the absence of class labels for these proposals, an auxiliary
loss is used to add high-level semantics. This results in complex pipelines and
a task gap between the pretraining and the downstream task. We propose a
framework that mitigates this issue and consists of three simple yet key
ingredients: (i) richer initial proposals that do encode high-level semantics,
(ii) class pseudo-labeling through clustering, that enables pretraining using a
standard object detection training pipeline, (iii) self-training to iteratively
improve and enrich the object proposals. Once the pretraining and downstream
tasks are aligned, a simple detection pipeline without further bells and
whistles can be directly used for pretraining and, in fact, results in
state-of-the-art performance on both the full and low data regimes, across
detector architectures and datasets, by significant margins. We further show
that our pretraining strategy is also capable of pretraining from scratch
(including the backbone) and works on complex images like COCO, paving the path
for unsupervised representation learning using object detection directly as a
pretext task. |
---|---|
DOI: | 10.48550/arxiv.2307.15697 |