WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2021-07
Hauptverfasser: Huo, Yuqi, Zhang, Manli, Liu, Guangzhen, Lu, Haoyu, Gao, Yizhao, Yang, Guoxing, Wen, Jingyuan, Zhang, Heng, Xu, Baogui, Zheng, Weihao, Zongzheng Xi, Yang, Yueqian, Hu, Anwen, Zhao, Jinming, Li, Ruichen, Zhao, Yida, Zhang, Liang, Song, Yuqing, Hong, Xin, Cui, Wanqing, Hou, Danyang, Li, Yingyan, Li, Junyi, Liu, Peiyu, Gong, Zheng, Jin, Chuhao, Sun, Yuchong, Chen, Shizhe, Lu, Zhiwu, Dou, Zhicheng, Qin, Jin, Lan, Yanyan, Wayne Xin Zhao, Song, Ruihua, Ji-Rong, Wen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Huo, Yuqi
Zhang, Manli
Liu, Guangzhen
Lu, Haoyu
Gao, Yizhao
Yang, Guoxing
Wen, Jingyuan
Zhang, Heng
Xu, Baogui
Zheng, Weihao
Zongzheng Xi
Yang, Yueqian
Hu, Anwen
Zhao, Jinming
Li, Ruichen
Zhao, Yida
Zhang, Liang
Song, Yuqing
Hong, Xin
Cui, Wanqing
Hou, Danyang
Li, Yingyan
Li, Junyi
Liu, Peiyu
Gong, Zheng
Jin, Chuhao
Sun, Yuchong
Chen, Shizhe
Lu, Zhiwu
Dou, Zhicheng
Qin, Jin
Lan, Yanyan
Wayne Xin Zhao
Song, Ruihua
Ji-Rong, Wen
description Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2500710731</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2500710731</sourcerecordid><originalsourceid>FETCH-proquest_journals_25007107313</originalsourceid><addsrcrecordid>eNqNi7EKwjAURYMgWLT_8MA5kCbWiqOiOLQgWHQs0T5DSkg0aQb_3gx-gNO5cO6ZkIwLUdDNivMZyUMYGGN8XfGyFBmpb2hrabew87pX2iq46qCdBWl7SEJFqRDun7S9Qnp5SIPQRDNq2rheGjh7pK2X2qZ2QaZPaQLmP87J8nho9yf68u4dMYzd4KK3SXW8ZKwqWCUK8d_rC5gZPAk</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2500710731</pqid></control><display><type>article</type><title>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</title><source>Free E- Journals</source><creator>Huo, Yuqi ; Zhang, Manli ; Liu, Guangzhen ; Lu, Haoyu ; Gao, Yizhao ; Yang, Guoxing ; Wen, Jingyuan ; Zhang, Heng ; Xu, Baogui ; Zheng, Weihao ; Zongzheng Xi ; Yang, Yueqian ; Hu, Anwen ; Zhao, Jinming ; Li, Ruichen ; Zhao, Yida ; Zhang, Liang ; Song, Yuqing ; Hong, Xin ; Cui, Wanqing ; Hou, Danyang ; Li, Yingyan ; Li, Junyi ; Liu, Peiyu ; Gong, Zheng ; Jin, Chuhao ; Sun, Yuchong ; Chen, Shizhe ; Lu, Zhiwu ; Dou, Zhicheng ; Qin, Jin ; Lan, Yanyan ; Wayne Xin Zhao ; Song, Ruihua ; Ji-Rong, Wen</creator><creatorcontrib>Huo, Yuqi ; Zhang, Manli ; Liu, Guangzhen ; Lu, Haoyu ; Gao, Yizhao ; Yang, Guoxing ; Wen, Jingyuan ; Zhang, Heng ; Xu, Baogui ; Zheng, Weihao ; Zongzheng Xi ; Yang, Yueqian ; Hu, Anwen ; Zhao, Jinming ; Li, Ruichen ; Zhao, Yida ; Zhang, Liang ; Song, Yuqing ; Hong, Xin ; Cui, Wanqing ; Hou, Danyang ; Li, Yingyan ; Li, Junyi ; Liu, Peiyu ; Gong, Zheng ; Jin, Chuhao ; Sun, Yuchong ; Chen, Shizhe ; Lu, Zhiwu ; Dou, Zhicheng ; Qin, Jin ; Lan, Yanyan ; Wayne Xin Zhao ; Song, Ruihua ; Ji-Rong, Wen</creatorcontrib><description>Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Correlation ; Machine learning ; Training ; Vision</subject><ispartof>arXiv.org, 2021-07</ispartof><rights>2021. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Huo, Yuqi</creatorcontrib><creatorcontrib>Zhang, Manli</creatorcontrib><creatorcontrib>Liu, Guangzhen</creatorcontrib><creatorcontrib>Lu, Haoyu</creatorcontrib><creatorcontrib>Gao, Yizhao</creatorcontrib><creatorcontrib>Yang, Guoxing</creatorcontrib><creatorcontrib>Wen, Jingyuan</creatorcontrib><creatorcontrib>Zhang, Heng</creatorcontrib><creatorcontrib>Xu, Baogui</creatorcontrib><creatorcontrib>Zheng, Weihao</creatorcontrib><creatorcontrib>Zongzheng Xi</creatorcontrib><creatorcontrib>Yang, Yueqian</creatorcontrib><creatorcontrib>Hu, Anwen</creatorcontrib><creatorcontrib>Zhao, Jinming</creatorcontrib><creatorcontrib>Li, Ruichen</creatorcontrib><creatorcontrib>Zhao, Yida</creatorcontrib><creatorcontrib>Zhang, Liang</creatorcontrib><creatorcontrib>Song, Yuqing</creatorcontrib><creatorcontrib>Hong, Xin</creatorcontrib><creatorcontrib>Cui, Wanqing</creatorcontrib><creatorcontrib>Hou, Danyang</creatorcontrib><creatorcontrib>Li, Yingyan</creatorcontrib><creatorcontrib>Li, Junyi</creatorcontrib><creatorcontrib>Liu, Peiyu</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Jin, Chuhao</creatorcontrib><creatorcontrib>Sun, Yuchong</creatorcontrib><creatorcontrib>Chen, Shizhe</creatorcontrib><creatorcontrib>Lu, Zhiwu</creatorcontrib><creatorcontrib>Dou, Zhicheng</creatorcontrib><creatorcontrib>Qin, Jin</creatorcontrib><creatorcontrib>Lan, Yanyan</creatorcontrib><creatorcontrib>Wayne Xin Zhao</creatorcontrib><creatorcontrib>Song, Ruihua</creatorcontrib><creatorcontrib>Ji-Rong, Wen</creatorcontrib><title>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</title><title>arXiv.org</title><description>Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.</description><subject>Algorithms</subject><subject>Correlation</subject><subject>Machine learning</subject><subject>Training</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNi7EKwjAURYMgWLT_8MA5kCbWiqOiOLQgWHQs0T5DSkg0aQb_3gx-gNO5cO6ZkIwLUdDNivMZyUMYGGN8XfGyFBmpb2hrabew87pX2iq46qCdBWl7SEJFqRDun7S9Qnp5SIPQRDNq2rheGjh7pK2X2qZ2QaZPaQLmP87J8nho9yf68u4dMYzd4KK3SXW8ZKwqWCUK8d_rC5gZPAk</recordid><startdate>20210708</startdate><enddate>20210708</enddate><creator>Huo, Yuqi</creator><creator>Zhang, Manli</creator><creator>Liu, Guangzhen</creator><creator>Lu, Haoyu</creator><creator>Gao, Yizhao</creator><creator>Yang, Guoxing</creator><creator>Wen, Jingyuan</creator><creator>Zhang, Heng</creator><creator>Xu, Baogui</creator><creator>Zheng, Weihao</creator><creator>Zongzheng Xi</creator><creator>Yang, Yueqian</creator><creator>Hu, Anwen</creator><creator>Zhao, Jinming</creator><creator>Li, Ruichen</creator><creator>Zhao, Yida</creator><creator>Zhang, Liang</creator><creator>Song, Yuqing</creator><creator>Hong, Xin</creator><creator>Cui, Wanqing</creator><creator>Hou, Danyang</creator><creator>Li, Yingyan</creator><creator>Li, Junyi</creator><creator>Liu, Peiyu</creator><creator>Gong, Zheng</creator><creator>Jin, Chuhao</creator><creator>Sun, Yuchong</creator><creator>Chen, Shizhe</creator><creator>Lu, Zhiwu</creator><creator>Dou, Zhicheng</creator><creator>Qin, Jin</creator><creator>Lan, Yanyan</creator><creator>Wayne Xin Zhao</creator><creator>Song, Ruihua</creator><creator>Ji-Rong, Wen</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210708</creationdate><title>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</title><author>Huo, Yuqi ; Zhang, Manli ; Liu, Guangzhen ; Lu, Haoyu ; Gao, Yizhao ; Yang, Guoxing ; Wen, Jingyuan ; Zhang, Heng ; Xu, Baogui ; Zheng, Weihao ; Zongzheng Xi ; Yang, Yueqian ; Hu, Anwen ; Zhao, Jinming ; Li, Ruichen ; Zhao, Yida ; Zhang, Liang ; Song, Yuqing ; Hong, Xin ; Cui, Wanqing ; Hou, Danyang ; Li, Yingyan ; Li, Junyi ; Liu, Peiyu ; Gong, Zheng ; Jin, Chuhao ; Sun, Yuchong ; Chen, Shizhe ; Lu, Zhiwu ; Dou, Zhicheng ; Qin, Jin ; Lan, Yanyan ; Wayne Xin Zhao ; Song, Ruihua ; Ji-Rong, Wen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25007107313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Correlation</topic><topic>Machine learning</topic><topic>Training</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Huo, Yuqi</creatorcontrib><creatorcontrib>Zhang, Manli</creatorcontrib><creatorcontrib>Liu, Guangzhen</creatorcontrib><creatorcontrib>Lu, Haoyu</creatorcontrib><creatorcontrib>Gao, Yizhao</creatorcontrib><creatorcontrib>Yang, Guoxing</creatorcontrib><creatorcontrib>Wen, Jingyuan</creatorcontrib><creatorcontrib>Zhang, Heng</creatorcontrib><creatorcontrib>Xu, Baogui</creatorcontrib><creatorcontrib>Zheng, Weihao</creatorcontrib><creatorcontrib>Zongzheng Xi</creatorcontrib><creatorcontrib>Yang, Yueqian</creatorcontrib><creatorcontrib>Hu, Anwen</creatorcontrib><creatorcontrib>Zhao, Jinming</creatorcontrib><creatorcontrib>Li, Ruichen</creatorcontrib><creatorcontrib>Zhao, Yida</creatorcontrib><creatorcontrib>Zhang, Liang</creatorcontrib><creatorcontrib>Song, Yuqing</creatorcontrib><creatorcontrib>Hong, Xin</creatorcontrib><creatorcontrib>Cui, Wanqing</creatorcontrib><creatorcontrib>Hou, Danyang</creatorcontrib><creatorcontrib>Li, Yingyan</creatorcontrib><creatorcontrib>Li, Junyi</creatorcontrib><creatorcontrib>Liu, Peiyu</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Jin, Chuhao</creatorcontrib><creatorcontrib>Sun, Yuchong</creatorcontrib><creatorcontrib>Chen, Shizhe</creatorcontrib><creatorcontrib>Lu, Zhiwu</creatorcontrib><creatorcontrib>Dou, Zhicheng</creatorcontrib><creatorcontrib>Qin, Jin</creatorcontrib><creatorcontrib>Lan, Yanyan</creatorcontrib><creatorcontrib>Wayne Xin Zhao</creatorcontrib><creatorcontrib>Song, Ruihua</creatorcontrib><creatorcontrib>Ji-Rong, Wen</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Huo, Yuqi</au><au>Zhang, Manli</au><au>Liu, Guangzhen</au><au>Lu, Haoyu</au><au>Gao, Yizhao</au><au>Yang, Guoxing</au><au>Wen, Jingyuan</au><au>Zhang, Heng</au><au>Xu, Baogui</au><au>Zheng, Weihao</au><au>Zongzheng Xi</au><au>Yang, Yueqian</au><au>Hu, Anwen</au><au>Zhao, Jinming</au><au>Li, Ruichen</au><au>Zhao, Yida</au><au>Zhang, Liang</au><au>Song, Yuqing</au><au>Hong, Xin</au><au>Cui, Wanqing</au><au>Hou, Danyang</au><au>Li, Yingyan</au><au>Li, Junyi</au><au>Liu, Peiyu</au><au>Gong, Zheng</au><au>Jin, Chuhao</au><au>Sun, Yuchong</au><au>Chen, Shizhe</au><au>Lu, Zhiwu</au><au>Dou, Zhicheng</au><au>Qin, Jin</au><au>Lan, Yanyan</au><au>Wayne Xin Zhao</au><au>Song, Ruihua</au><au>Ji-Rong, Wen</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</atitle><jtitle>arXiv.org</jtitle><date>2021-07-08</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2021-07
issn 2331-8422
language eng
recordid cdi_proquest_journals_2500710731
source Free E- Journals
subjects Algorithms
Correlation
Machine learning
Training
Vision
title WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T06%3A17%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=WenLan:%20Bridging%20Vision%20and%20Language%20by%20Large-Scale%20Multi-Modal%20Pre-Training&rft.jtitle=arXiv.org&rft.au=Huo,%20Yuqi&rft.date=2021-07-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2500710731%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2500710731&rft_id=info:pmid/&rfr_iscdi=true