Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark
Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models an...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Vision-Language Pre-training (VLP) models have shown remarkable performance
on various downstream tasks. Their success heavily relies on the scale of
pre-trained cross-modal datasets. However, the lack of large-scale datasets and
benchmarks in Chinese hinders the development of Chinese VLP models and broader
multilingual applications. In this work, we release a large-scale Chinese
cross-modal dataset named Wukong, which contains 100 million Chinese image-text
pairs collected from the web. Wukong aims to benchmark different multi-modal
pre-training methods to facilitate the VLP research and community development.
Furthermore, we release a group of models pre-trained with various image
encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques
into VLP such as locked-image text tuning, token-wise similarity in contrastive
learning, and reduced-token interaction. Extensive experiments and a
benchmarking of different downstream tasks including a new largest
human-verified image-text test dataset are also provided. Experiments show that
Wukong can serve as a promising Chinese pre-training dataset and benchmark for
different cross-modal learning methods. For the zero-shot image classification
task on 10 datasets, $Wukong_{ViT-L}$ achieves an average accuracy of 73.03%.
For the image-text retrieval task, it achieves a mean recall of 71.6% on
AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are
benchmarked on downstream tasks with other variants on multiple datasets, e.g.,
Flickr8K-CN, Flickr-30K-CN, COCO-CN, et al. More information can be referred
to: https://wukong-dataset.github.io/wukong-dataset/. |
---|---|
DOI: | 10.48550/arxiv.2202.06767 |