WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities....

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2021-07
Hauptverfasser:	Huo, Yuqi, Zhang, Manli, Liu, Guangzhen, Lu, Haoyu, Gao, Yizhao, Yang, Guoxing, Wen, Jingyuan, Zhang, Heng, Xu, Baogui, Zheng, Weihao, Zongzheng Xi, Yang, Yueqian, Hu, Anwen, Zhao, Jinming, Li, Ruichen, Zhao, Yida, Zhang, Liang, Song, Yuqing, Hong, Xin, Cui, Wanqing, Hou, Danyang, Li, Yingyan, Li, Junyi, Liu, Peiyu, Gong, Zheng, Jin, Chuhao, Sun, Yuchong, Chen, Shizhe, Lu, Zhiwu, Dou, Zhicheng, Qin, Jin, Lan, Yanyan, Wayne Xin Zhao, Song, Ruihua, Ji-Rong, Wen
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Correlation Machine learning Training Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Huo, Yuqi Zhang, Manli Liu, Guangzhen Lu, Haoyu Gao, Yizhao Yang, Guoxing Wen, Jingyuan Zhang, Heng Xu, Baogui Zheng, Weihao Zongzheng Xi Yang, Yueqian Hu, Anwen Zhao, Jinming Li, Ruichen Zhao, Yida Zhang, Liang Song, Yuqing Hong, Xin Cui, Wanqing Hou, Danyang Li, Yingyan Li, Junyi Liu, Peiyu Gong, Zheng Jin, Chuhao Sun, Yuchong Chen, Shizhe Lu, Zhiwu Dou, Zhicheng Qin, Jin Lan, Yanyan Wayne Xin Zhao Song, Ruihua Ji-Rong, Wen
description	Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2500710731</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2500710731</sourcerecordid><originalsourceid>FETCH-proquest_journals_25007107313</originalsourceid><addsrcrecordid>eNqNi7EKwjAURYMgWLT_8MA5kCbWiqOiOLQgWHQs0T5DSkg0aQb_3gx-gNO5cO6ZkIwLUdDNivMZyUMYGGN8XfGyFBmpb2hrabew87pX2iq46qCdBWl7SEJFqRDun7S9Qnp5SIPQRDNq2rheGjh7pK2X2qZ2QaZPaQLmP87J8nho9yf68u4dMYzd4KK3SXW8ZKwqWCUK8d_rC5gZPAk</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2500710731</pqid></control><display><type>article</type><title>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</title><source>Free E- Journals</source><creator>Huo, Yuqi ; Zhang, Manli ; Liu, Guangzhen ; Lu, Haoyu ; Gao, Yizhao ; Yang, Guoxing ; Wen, Jingyuan ; Zhang, Heng ; Xu, Baogui ; Zheng, Weihao ; Zongzheng Xi ; Yang, Yueqian ; Hu, Anwen ; Zhao, Jinming ; Li, Ruichen ; Zhao, Yida ; Zhang, Liang ; Song, Yuqing ; Hong, Xin ; Cui, Wanqing ; Hou, Danyang ; Li, Yingyan ; Li, Junyi ; Liu, Peiyu ; Gong, Zheng ; Jin, Chuhao ; Sun, Yuchong ; Chen, Shizhe ; Lu, Zhiwu ; Dou, Zhicheng ; Qin, Jin ; Lan, Yanyan ; Wayne Xin Zhao ; Song, Ruihua ; Ji-Rong, Wen</creator><creatorcontrib>Huo, Yuqi ; Zhang, Manli ; Liu, Guangzhen ; Lu, Haoyu ; Gao, Yizhao ; Yang, Guoxing ; Wen, Jingyuan ; Zhang, Heng ; Xu, Baogui ; Zheng, Weihao ; Zongzheng Xi ; Yang, Yueqian ; Hu, Anwen ; Zhao, Jinming ; Li, Ruichen ; Zhao, Yida ; Zhang, Liang ; Song, Yuqing ; Hong, Xin ; Cui, Wanqing ; Hou, Danyang ; Li, Yingyan ; Li, Junyi ; Liu, Peiyu ; Gong, Zheng ; Jin, Chuhao ; Sun, Yuchong ; Chen, Shizhe ; Lu, Zhiwu ; Dou, Zhicheng ; Qin, Jin ; Lan, Yanyan ; Wayne Xin Zhao ; Song, Ruihua ; Ji-Rong, Wen</creatorcontrib><description>Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Correlation ; Machine learning ; Training ; Vision</subject><ispartof>arXiv.org, 2021-07</ispartof><rights>2021. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Huo, Yuqi</creatorcontrib><creatorcontrib>Zhang, Manli</creatorcontrib><creatorcontrib>Liu, Guangzhen</creatorcontrib><creatorcontrib>Lu, Haoyu</creatorcontrib><creatorcontrib>Gao, Yizhao</creatorcontrib><creatorcontrib>Yang, Guoxing</creatorcontrib><creatorcontrib>Wen, Jingyuan</creatorcontrib><creatorcontrib>Zhang, Heng</creatorcontrib><creatorcontrib>Xu, Baogui</creatorcontrib><creatorcontrib>Zheng, Weihao</creatorcontrib><creatorcontrib>Zongzheng Xi</creatorcontrib><creatorcontrib>Yang, Yueqian</creatorcontrib><creatorcontrib>Hu, Anwen</creatorcontrib><creatorcontrib>Zhao, Jinming</creatorcontrib><creatorcontrib>Li, Ruichen</creatorcontrib><creatorcontrib>Zhao, Yida</creatorcontrib><creatorcontrib>Zhang, Liang</creatorcontrib><creatorcontrib>Song, Yuqing</creatorcontrib><creatorcontrib>Hong, Xin</creatorcontrib><creatorcontrib>Cui, Wanqing</creatorcontrib><creatorcontrib>Hou, Danyang</creatorcontrib><creatorcontrib>Li, Yingyan</creatorcontrib><creatorcontrib>Li, Junyi</creatorcontrib><creatorcontrib>Liu, Peiyu</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Jin, Chuhao</creatorcontrib><creatorcontrib>Sun, Yuchong</creatorcontrib><creatorcontrib>Chen, Shizhe</creatorcontrib><creatorcontrib>Lu, Zhiwu</creatorcontrib><creatorcontrib>Dou, Zhicheng</creatorcontrib><creatorcontrib>Qin, Jin</creatorcontrib><creatorcontrib>Lan, Yanyan</creatorcontrib><creatorcontrib>Wayne Xin Zhao</creatorcontrib><creatorcontrib>Song, Ruihua</creatorcontrib><creatorcontrib>Ji-Rong, Wen</creatorcontrib><title>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</title><title>arXiv.org</title><description>Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.</description><subject>Algorithms</subject><subject>Correlation</subject><subject>Machine learning</subject><subject>Training</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNi7EKwjAURYMgWLT_8MA5kCbWiqOiOLQgWHQs0T5DSkg0aQb_3gx-gNO5cO6ZkIwLUdDNivMZyUMYGGN8XfGyFBmpb2hrabew87pX2iq46qCdBWl7SEJFqRDun7S9Qnp5SIPQRDNq2rheGjh7pK2X2qZ2QaZPaQLmP87J8nho9yf68u4dMYzd4KK3SXW8ZKwqWCUK8d_rC5gZPAk</recordid><startdate>20210708</startdate><enddate>20210708</enddate><creator>Huo, Yuqi</creator><creator>Zhang, Manli</creator><creator>Liu, Guangzhen</creator><creator>Lu, Haoyu</creator><creator>Gao, Yizhao</creator><creator>Yang, Guoxing</creator><creator>Wen, Jingyuan</creator><creator>Zhang, Heng</creator><creator>Xu, Baogui</creator><creator>Zheng, Weihao</creator><creator>Zongzheng Xi</creator><creator>Yang, Yueqian</creator><creator>Hu, Anwen</creator><creator>Zhao, Jinming</creator><creator>Li, Ruichen</creator><creator>Zhao, Yida</creator><creator>Zhang, Liang</creator><creator>Song, Yuqing</creator><creator>Hong, Xin</creator><creator>Cui, Wanqing</creator><creator>Hou, Danyang</creator><creator>Li, Yingyan</creator><creator>Li, Junyi</creator><creator>Liu, Peiyu</creator><creator>Gong, Zheng</creator><creator>Jin, Chuhao</creator><creator>Sun, Yuchong</creator><creator>Chen, Shizhe</creator><creator>Lu, Zhiwu</creator><creator>Dou, Zhicheng</creator><creator>Qin, Jin</creator><creator>Lan, Yanyan</creator><creator>Wayne Xin Zhao</creator><creator>Song, Ruihua</creator><creator>Ji-Rong, Wen</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210708</creationdate><title>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</title><author>Huo, Yuqi ; Zhang, Manli ; Liu, Guangzhen ; Lu, Haoyu ; Gao, Yizhao ; Yang, Guoxing ; Wen, Jingyuan ; Zhang, Heng ; Xu, Baogui ; Zheng, Weihao ; Zongzheng Xi ; Yang, Yueqian ; Hu, Anwen ; Zhao, Jinming ; Li, Ruichen ; Zhao, Yida ; Zhang, Liang ; Song, Yuqing ; Hong, Xin ; Cui, Wanqing ; Hou, Danyang ; Li, Yingyan ; Li, Junyi ; Liu, Peiyu ; Gong, Zheng ; Jin, Chuhao ; Sun, Yuchong ; Chen, Shizhe ; Lu, Zhiwu ; Dou, Zhicheng ; Qin, Jin ; Lan, Yanyan ; Wayne Xin Zhao ; Song, Ruihua ; Ji-Rong, Wen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25007107313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Correlation</topic><topic>Machine learning</topic><topic>Training</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Huo, Yuqi</creatorcontrib><creatorcontrib>Zhang, Manli</creatorcontrib><creatorcontrib>Liu, Guangzhen</creatorcontrib><creatorcontrib>Lu, Haoyu</creatorcontrib><creatorcontrib>Gao, Yizhao</creatorcontrib><creatorcontrib>Yang, Guoxing</creatorcontrib><creatorcontrib>Wen, Jingyuan</creatorcontrib><creatorcontrib>Zhang, Heng</creatorcontrib><creatorcontrib>Xu, Baogui</creatorcontrib><creatorcontrib>Zheng, Weihao</creatorcontrib><creatorcontrib>Zongzheng Xi</creatorcontrib><creatorcontrib>Yang, Yueqian</creatorcontrib><creatorcontrib>Hu, Anwen</creatorcontrib><creatorcontrib>Zhao, Jinming</creatorcontrib><creatorcontrib>Li, Ruichen</creatorcontrib><creatorcontrib>Zhao, Yida</creatorcontrib><creatorcontrib>Zhang, Liang</creatorcontrib><creatorcontrib>Song, Yuqing</creatorcontrib><creatorcontrib>Hong, Xin</creatorcontrib><creatorcontrib>Cui, Wanqing</creatorcontrib><creatorcontrib>Hou, Danyang</creatorcontrib><creatorcontrib>Li, Yingyan</creatorcontrib><creatorcontrib>Li, Junyi</creatorcontrib><creatorcontrib>Liu, Peiyu</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Jin, Chuhao</creatorcontrib><creatorcontrib>Sun, Yuchong</creatorcontrib><creatorcontrib>Chen, Shizhe</creatorcontrib><creatorcontrib>Lu, Zhiwu</creatorcontrib><creatorcontrib>Dou, Zhicheng</creatorcontrib><creatorcontrib>Qin, Jin</creatorcontrib><creatorcontrib>Lan, Yanyan</creatorcontrib><creatorcontrib>Wayne Xin Zhao</creatorcontrib><creatorcontrib>Song, Ruihua</creatorcontrib><creatorcontrib>Ji-Rong, Wen</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Huo, Yuqi</au><au>Zhang, Manli</au><au>Liu, Guangzhen</au><au>Lu, Haoyu</au><au>Gao, Yizhao</au><au>Yang, Guoxing</au><au>Wen, Jingyuan</au><au>Zhang, Heng</au><au>Xu, Baogui</au><au>Zheng, Weihao</au><au>Zongzheng Xi</au><au>Yang, Yueqian</au><au>Hu, Anwen</au><au>Zhao, Jinming</au><au>Li, Ruichen</au><au>Zhao, Yida</au><au>Zhang, Liang</au><au>Song, Yuqing</au><au>Hong, Xin</au><au>Cui, Wanqing</au><au>Hou, Danyang</au><au>Li, Yingyan</au><au>Li, Junyi</au><au>Liu, Peiyu</au><au>Gong, Zheng</au><au>Jin, Chuhao</au><au>Sun, Yuchong</au><au>Chen, Shizhe</au><au>Lu, Zhiwu</au><au>Dou, Zhicheng</au><au>Qin, Jin</au><au>Lan, Yanyan</au><au>Wayne Xin Zhao</au><au>Song, Ruihua</au><au>Ji-Rong, Wen</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</atitle><jtitle>arXiv.org</jtitle><date>2021-07-08</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2021-07
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2500710731
source	Free E- Journals
subjects	Algorithms Correlation Machine learning Training Vision
title	WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T06%3A17%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=WenLan:%20Bridging%20Vision%20and%20Language%20by%20Large-Scale%20Multi-Modal%20Pre-Training&rft.jtitle=arXiv.org&rft.au=Huo,%20Yuqi&rft.date=2021-07-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2500710731%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2500710731&rft_id=info:pmid/&rfr_iscdi=true