Pre-training with Random Orthogonal Projection Image Modeling
Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual pre-training without the use of labels. MIM applies random crops to input images, processes them with an encoder, and then recovers the masked inputs with a decoder, which encourages the network to capture and learn struct...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Masked Image Modeling (MIM) is a powerful self-supervised strategy for visual
pre-training without the use of labels. MIM applies random crops to input
images, processes them with an encoder, and then recovers the masked inputs
with a decoder, which encourages the network to capture and learn structural
information about objects and scenes. The intermediate feature representations
obtained from MIM are suitable for fine-tuning on downstream tasks. In this
paper, we propose an Image Modeling framework based on random orthogonal
projection instead of binary masking as in MIM. Our proposed Random Orthogonal
Projection Image Modeling (ROPIM) reduces spatially-wise token information
under guaranteed bound on the noise variance and can be considered as masking
entire spatial image area under locally varying masking degrees. Since ROPIM
uses a random subspace for the projection that realizes the masking step, the
readily available complement of the subspace can be used during unmasking to
promote recovery of removed information. In this paper, we show that using
random orthogonal projection leads to superior performance compared to
crop-based masking. We demonstrate state-of-the-art results on several popular
benchmarks. |
---|---|
DOI: | 10.48550/arxiv.2310.18737 |