Global and Local Semantic Completion Learning for Vision-Language Pre-training
Cross-modal alignment plays a crucial role in vision-language pre-training (VLP) models, enabling them to capture meaningful associations across different modalities. For this purpose, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core ide...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Cross-modal alignment plays a crucial role in vision-language pre-training
(VLP) models, enabling them to capture meaningful associations across different
modalities. For this purpose, numerous masked modeling tasks have been proposed
for VLP to further promote cross-modal interactions. The core idea of previous
masked modeling tasks is to focus on reconstructing the masked tokens based on
visible context for learning local-local alignment. However, most of them pay
little attention to the global semantic features generated for the masked data,
resulting in a limited cross-modal alignment ability of global representations
to local features of the other modality. Therefore, in this paper, we propose a
novel Global and Local Semantic Completion Learning (GLSCL) task to facilitate
global-local alignment and local-local alignment simultaneously. Specifically,
the GLSCL task complements the missing semantics of masked data and recovers
global and local features by cross-modal interactions. Our GLSCL consists of
masked global semantic completion (MGSC) and masked local token completion
(MLTC). MGSC promotes learning more representative global features, which have
a great impact on the performance of downstream tasks, while MLTC reconstructs
modal-fusion local tokens, further enhancing accurate comprehension of
multimodal data. To evaluate the proposed approaches on cross-modal alignment,
we develop a validation benchmark called ALIGN-BENCH. Moreover, we present a
flexible vision encoder, enabling our model to simultaneously perform
image-text and video-text multimodal tasks. Experimental results show that our
proposed method obtains state-of-the-art performance on various vision-language
benchmarks, such as visual question answering, image-text retrieval, and
video-text retrieval. |
---|---|
DOI: | 10.48550/arxiv.2306.07096 |