Morphing Tokens Draw Strong Masked Image Models
Masked image modeling (MIM) has emerged as a promising approach for training Vision Transformers (ViTs). The essence of MIM lies in the token-wise prediction of masked tokens, which aims to predict targets tokenized from images or generated by pre-trained models like vision-language models. While us...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Masked image modeling (MIM) has emerged as a promising approach for training
Vision Transformers (ViTs). The essence of MIM lies in the token-wise
prediction of masked tokens, which aims to predict targets tokenized from
images or generated by pre-trained models like vision-language models. While
using tokenizers or pre-trained models are plausible MIM targets, they often
offer spatially inconsistent targets even for neighboring tokens, complicating
models to learn unified and discriminative representations. Our pilot study
identifies spatial inconsistencies and suggests that resolving them can
accelerate representation learning. Building upon this insight, we introduce a
novel self-supervision signal called Dynamic Token Morphing (DTM), which
dynamically aggregates contextually related tokens to yield contextualized
targets, thereby mitigating spatial inconsistency. DTM is compatible with
various SSL frameworks; we showcase improved MIM results by employing DTM,
barely introducing extra training costs. Our method facilitates training by
using consistent targets, resulting in 1) faster training and 2) reduced
losses. Experiments on ImageNet-1K and ADE20K demonstrate the superiority of
our method compared with state-of-the-art, complex MIM methods. Furthermore,
the comparative evaluation of the iNaturalists and fine-grained visual
classification datasets further validates the transferability of our method on
various downstream tasks. Code is available at https://github.com/naver-ai/dtm |
---|---|
DOI: | 10.48550/arxiv.2401.00254 |