WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities....
Gespeichert in:
Veröffentlicht in: | arXiv.org 2021-07 |
---|---|
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Huo, Yuqi Zhang, Manli Liu, Guangzhen Lu, Haoyu Gao, Yizhao Yang, Guoxing Wen, Jingyuan Zhang, Heng Xu, Baogui Zheng, Weihao Zongzheng Xi Yang, Yueqian Hu, Anwen Zhao, Jinming Li, Ruichen Zhao, Yida Zhang, Liang Song, Yuqing Hong, Xin Cui, Wanqing Hou, Danyang Li, Yingyan Li, Junyi Liu, Peiyu Gong, Zheng Jin, Chuhao Sun, Yuchong Chen, Shizhe Lu, Zhiwu Dou, Zhicheng Qin, Jin Lan, Yanyan Wayne Xin Zhao Song, Ruihua Ji-Rong, Wen |
description | Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2500710731</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2500710731</sourcerecordid><originalsourceid>FETCH-proquest_journals_25007107313</originalsourceid><addsrcrecordid>eNqNi7EKwjAURYMgWLT_8MA5kCbWiqOiOLQgWHQs0T5DSkg0aQb_3gx-gNO5cO6ZkIwLUdDNivMZyUMYGGN8XfGyFBmpb2hrabew87pX2iq46qCdBWl7SEJFqRDun7S9Qnp5SIPQRDNq2rheGjh7pK2X2qZ2QaZPaQLmP87J8nho9yf68u4dMYzd4KK3SXW8ZKwqWCUK8d_rC5gZPAk</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2500710731</pqid></control><display><type>article</type><title>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</title><source>Free E- Journals</source><creator>Huo, Yuqi ; Zhang, Manli ; Liu, Guangzhen ; Lu, Haoyu ; Gao, Yizhao ; Yang, Guoxing ; Wen, Jingyuan ; Zhang, Heng ; Xu, Baogui ; Zheng, Weihao ; Zongzheng Xi ; Yang, Yueqian ; Hu, Anwen ; Zhao, Jinming ; Li, Ruichen ; Zhao, Yida ; Zhang, Liang ; Song, Yuqing ; Hong, Xin ; Cui, Wanqing ; Hou, Danyang ; Li, Yingyan ; Li, Junyi ; Liu, Peiyu ; Gong, Zheng ; Jin, Chuhao ; Sun, Yuchong ; Chen, Shizhe ; Lu, Zhiwu ; Dou, Zhicheng ; Qin, Jin ; Lan, Yanyan ; Wayne Xin Zhao ; Song, Ruihua ; Ji-Rong, Wen</creator><creatorcontrib>Huo, Yuqi ; Zhang, Manli ; Liu, Guangzhen ; Lu, Haoyu ; Gao, Yizhao ; Yang, Guoxing ; Wen, Jingyuan ; Zhang, Heng ; Xu, Baogui ; Zheng, Weihao ; Zongzheng Xi ; Yang, Yueqian ; Hu, Anwen ; Zhao, Jinming ; Li, Ruichen ; Zhao, Yida ; Zhang, Liang ; Song, Yuqing ; Hong, Xin ; Cui, Wanqing ; Hou, Danyang ; Li, Yingyan ; Li, Junyi ; Liu, Peiyu ; Gong, Zheng ; Jin, Chuhao ; Sun, Yuchong ; Chen, Shizhe ; Lu, Zhiwu ; Dou, Zhicheng ; Qin, Jin ; Lan, Yanyan ; Wayne Xin Zhao ; Song, Ruihua ; Ji-Rong, Wen</creatorcontrib><description>Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Correlation ; Machine learning ; Training ; Vision</subject><ispartof>arXiv.org, 2021-07</ispartof><rights>2021. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Huo, Yuqi</creatorcontrib><creatorcontrib>Zhang, Manli</creatorcontrib><creatorcontrib>Liu, Guangzhen</creatorcontrib><creatorcontrib>Lu, Haoyu</creatorcontrib><creatorcontrib>Gao, Yizhao</creatorcontrib><creatorcontrib>Yang, Guoxing</creatorcontrib><creatorcontrib>Wen, Jingyuan</creatorcontrib><creatorcontrib>Zhang, Heng</creatorcontrib><creatorcontrib>Xu, Baogui</creatorcontrib><creatorcontrib>Zheng, Weihao</creatorcontrib><creatorcontrib>Zongzheng Xi</creatorcontrib><creatorcontrib>Yang, Yueqian</creatorcontrib><creatorcontrib>Hu, Anwen</creatorcontrib><creatorcontrib>Zhao, Jinming</creatorcontrib><creatorcontrib>Li, Ruichen</creatorcontrib><creatorcontrib>Zhao, Yida</creatorcontrib><creatorcontrib>Zhang, Liang</creatorcontrib><creatorcontrib>Song, Yuqing</creatorcontrib><creatorcontrib>Hong, Xin</creatorcontrib><creatorcontrib>Cui, Wanqing</creatorcontrib><creatorcontrib>Hou, Danyang</creatorcontrib><creatorcontrib>Li, Yingyan</creatorcontrib><creatorcontrib>Li, Junyi</creatorcontrib><creatorcontrib>Liu, Peiyu</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Jin, Chuhao</creatorcontrib><creatorcontrib>Sun, Yuchong</creatorcontrib><creatorcontrib>Chen, Shizhe</creatorcontrib><creatorcontrib>Lu, Zhiwu</creatorcontrib><creatorcontrib>Dou, Zhicheng</creatorcontrib><creatorcontrib>Qin, Jin</creatorcontrib><creatorcontrib>Lan, Yanyan</creatorcontrib><creatorcontrib>Wayne Xin Zhao</creatorcontrib><creatorcontrib>Song, Ruihua</creatorcontrib><creatorcontrib>Ji-Rong, Wen</creatorcontrib><title>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</title><title>arXiv.org</title><description>Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.</description><subject>Algorithms</subject><subject>Correlation</subject><subject>Machine learning</subject><subject>Training</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNi7EKwjAURYMgWLT_8MA5kCbWiqOiOLQgWHQs0T5DSkg0aQb_3gx-gNO5cO6ZkIwLUdDNivMZyUMYGGN8XfGyFBmpb2hrabew87pX2iq46qCdBWl7SEJFqRDun7S9Qnp5SIPQRDNq2rheGjh7pK2X2qZ2QaZPaQLmP87J8nho9yf68u4dMYzd4KK3SXW8ZKwqWCUK8d_rC5gZPAk</recordid><startdate>20210708</startdate><enddate>20210708</enddate><creator>Huo, Yuqi</creator><creator>Zhang, Manli</creator><creator>Liu, Guangzhen</creator><creator>Lu, Haoyu</creator><creator>Gao, Yizhao</creator><creator>Yang, Guoxing</creator><creator>Wen, Jingyuan</creator><creator>Zhang, Heng</creator><creator>Xu, Baogui</creator><creator>Zheng, Weihao</creator><creator>Zongzheng Xi</creator><creator>Yang, Yueqian</creator><creator>Hu, Anwen</creator><creator>Zhao, Jinming</creator><creator>Li, Ruichen</creator><creator>Zhao, Yida</creator><creator>Zhang, Liang</creator><creator>Song, Yuqing</creator><creator>Hong, Xin</creator><creator>Cui, Wanqing</creator><creator>Hou, Danyang</creator><creator>Li, Yingyan</creator><creator>Li, Junyi</creator><creator>Liu, Peiyu</creator><creator>Gong, Zheng</creator><creator>Jin, Chuhao</creator><creator>Sun, Yuchong</creator><creator>Chen, Shizhe</creator><creator>Lu, Zhiwu</creator><creator>Dou, Zhicheng</creator><creator>Qin, Jin</creator><creator>Lan, Yanyan</creator><creator>Wayne Xin Zhao</creator><creator>Song, Ruihua</creator><creator>Ji-Rong, Wen</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20210708</creationdate><title>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</title><author>Huo, Yuqi ; Zhang, Manli ; Liu, Guangzhen ; Lu, Haoyu ; Gao, Yizhao ; Yang, Guoxing ; Wen, Jingyuan ; Zhang, Heng ; Xu, Baogui ; Zheng, Weihao ; Zongzheng Xi ; Yang, Yueqian ; Hu, Anwen ; Zhao, Jinming ; Li, Ruichen ; Zhao, Yida ; Zhang, Liang ; Song, Yuqing ; Hong, Xin ; Cui, Wanqing ; Hou, Danyang ; Li, Yingyan ; Li, Junyi ; Liu, Peiyu ; Gong, Zheng ; Jin, Chuhao ; Sun, Yuchong ; Chen, Shizhe ; Lu, Zhiwu ; Dou, Zhicheng ; Qin, Jin ; Lan, Yanyan ; Wayne Xin Zhao ; Song, Ruihua ; Ji-Rong, Wen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_25007107313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Algorithms</topic><topic>Correlation</topic><topic>Machine learning</topic><topic>Training</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Huo, Yuqi</creatorcontrib><creatorcontrib>Zhang, Manli</creatorcontrib><creatorcontrib>Liu, Guangzhen</creatorcontrib><creatorcontrib>Lu, Haoyu</creatorcontrib><creatorcontrib>Gao, Yizhao</creatorcontrib><creatorcontrib>Yang, Guoxing</creatorcontrib><creatorcontrib>Wen, Jingyuan</creatorcontrib><creatorcontrib>Zhang, Heng</creatorcontrib><creatorcontrib>Xu, Baogui</creatorcontrib><creatorcontrib>Zheng, Weihao</creatorcontrib><creatorcontrib>Zongzheng Xi</creatorcontrib><creatorcontrib>Yang, Yueqian</creatorcontrib><creatorcontrib>Hu, Anwen</creatorcontrib><creatorcontrib>Zhao, Jinming</creatorcontrib><creatorcontrib>Li, Ruichen</creatorcontrib><creatorcontrib>Zhao, Yida</creatorcontrib><creatorcontrib>Zhang, Liang</creatorcontrib><creatorcontrib>Song, Yuqing</creatorcontrib><creatorcontrib>Hong, Xin</creatorcontrib><creatorcontrib>Cui, Wanqing</creatorcontrib><creatorcontrib>Hou, Danyang</creatorcontrib><creatorcontrib>Li, Yingyan</creatorcontrib><creatorcontrib>Li, Junyi</creatorcontrib><creatorcontrib>Liu, Peiyu</creatorcontrib><creatorcontrib>Gong, Zheng</creatorcontrib><creatorcontrib>Jin, Chuhao</creatorcontrib><creatorcontrib>Sun, Yuchong</creatorcontrib><creatorcontrib>Chen, Shizhe</creatorcontrib><creatorcontrib>Lu, Zhiwu</creatorcontrib><creatorcontrib>Dou, Zhicheng</creatorcontrib><creatorcontrib>Qin, Jin</creatorcontrib><creatorcontrib>Lan, Yanyan</creatorcontrib><creatorcontrib>Wayne Xin Zhao</creatorcontrib><creatorcontrib>Song, Ruihua</creatorcontrib><creatorcontrib>Ji-Rong, Wen</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Huo, Yuqi</au><au>Zhang, Manli</au><au>Liu, Guangzhen</au><au>Lu, Haoyu</au><au>Gao, Yizhao</au><au>Yang, Guoxing</au><au>Wen, Jingyuan</au><au>Zhang, Heng</au><au>Xu, Baogui</au><au>Zheng, Weihao</au><au>Zongzheng Xi</au><au>Yang, Yueqian</au><au>Hu, Anwen</au><au>Zhao, Jinming</au><au>Li, Ruichen</au><au>Zhao, Yida</au><au>Zhang, Liang</au><au>Song, Yuqing</au><au>Hong, Xin</au><au>Cui, Wanqing</au><au>Hou, Danyang</au><au>Li, Yingyan</au><au>Li, Junyi</au><au>Liu, Peiyu</au><au>Gong, Zheng</au><au>Jin, Chuhao</au><au>Sun, Yuchong</au><au>Chen, Shizhe</au><au>Lu, Zhiwu</au><au>Dou, Zhicheng</au><au>Qin, Jin</au><au>Lan, Yanyan</au><au>Wayne Xin Zhao</au><au>Song, Ruihua</au><au>Ji-Rong, Wen</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training</atitle><jtitle>arXiv.org</jtitle><date>2021-07-08</date><risdate>2021</risdate><eissn>2331-8422</eissn><abstract>Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2021-07 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2500710731 |
source | Free E- Journals |
subjects | Algorithms Correlation Machine learning Training Vision |
title | WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T06%3A17%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=WenLan:%20Bridging%20Vision%20and%20Language%20by%20Large-Scale%20Multi-Modal%20Pre-Training&rft.jtitle=arXiv.org&rft.au=Huo,%20Yuqi&rft.date=2021-07-08&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2500710731%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2500710731&rft_id=info:pmid/&rfr_iscdi=true |