Masked Vision and Language Modeling for Multi-modal Representation Learning
In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal o...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we study how to use masked signal modeling in vision and
language (V+L) representation learning. Instead of developing masked language
modeling (MLM) and masked image modeling (MIM) independently, we propose to
build joint masked vision and language modeling, where the masked signal of one
modality is reconstructed with the help from another modality. This is
motivated by the nature of image-text paired data that both of the image and
the text convey almost the same information but in different formats. The
masked signal reconstruction of one modality conditioned on another modality
can also implicitly learn cross-modal alignment between language tokens and
image patches. Our experiments on various V+L tasks show that the proposed
method, along with common V+L alignment losses, achieves state-of-the-art
performance in the regime of millions of pre-training data. Also, we
outperforms the other competitors by a significant margin in limited data
scenarios. |
---|---|
DOI: | 10.48550/arxiv.2208.02131 |