WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2021-07
Hauptverfasser: Huo, Yuqi, Zhang, Manli, Liu, Guangzhen, Lu, Haoyu, Gao, Yizhao, Yang, Guoxing, Wen, Jingyuan, Zhang, Heng, Xu, Baogui, Zheng, Weihao, Zongzheng Xi, Yang, Yueqian, Hu, Anwen, Zhao, Jinming, Li, Ruichen, Zhao, Yida, Zhang, Liang, Song, Yuqing, Hong, Xin, Cui, Wanqing, Hou, Danyang, Li, Yingyan, Li, Junyi, Liu, Peiyu, Gong, Zheng, Jin, Chuhao, Sun, Yuchong, Chen, Shizhe, Lu, Zhiwu, Dou, Zhicheng, Qin, Jin, Lan, Yanyan, Wayne Xin Zhao, Song, Ruihua, Ji-Rong, Wen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!