UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation
With the recent success of the pre-training technique for NLP and image-linguistic tasks, some video-linguistic pre-training works are gradually developed to improve video-text related downstream tasks. However, most of the existing multimodal models are pre-trained for understanding tasks, leading...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | With the recent success of the pre-training technique for NLP and
image-linguistic tasks, some video-linguistic pre-training works are gradually
developed to improve video-text related downstream tasks. However, most of the
existing multimodal models are pre-trained for understanding tasks, leading to
a pretrain-finetune discrepancy for generation tasks. This paper proposes
UniVL: a Unified Video and Language pre-training model for both multimodal
understanding and generation. It comprises four components, including two
single-modal encoders, a cross encoder, and a decoder with the Transformer
backbone. Five objectives, including video-text joint, conditioned masked
language model (CMLM), conditioned masked frame model (CMFM), video-text
alignment, and language reconstruction, are designed to train each of the
components. We further develop two pre-training strategies, stage by stage
pre-training (StagedP) and enhanced video representation (EnhancedV), to make
the training process of the UniVL more effective. The pre-train is carried out
on a sizeable instructional video dataset HowTo100M. Experimental results
demonstrate that the UniVL can learn strong video-text representation and
achieves state-of-the-art results on five downstream tasks. |
---|---|
DOI: | 10.48550/arxiv.2002.06353 |