CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-05
Hauptverfasser:	Pavan Kumar Anasosalu Vasu, Pouransari, Hadi, Faghri, Fartash, Oncel Tuzel
Format:	Artikel
Sprache:	eng
Schlagworte:	Image quality Image segmentation Object recognition Representations Semantic segmentation Semantics Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Pavan Kumar Anasosalu Vasu Pouransari, Hadi Faghri, Fartash Oncel Tuzel
description	CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3055631073</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3055631073</sourcerecordid><originalsourceid>FETCH-proquest_journals_30556310733</originalsourceid><addsrcrecordid>eNqNirEKwjAUAIMgWLT_8MC5kOaZVtxKURQcKhbXkiHV1JLUvBTx7-3gBzjdwd2MRQIxTbYbIRYsJuo45yLLhZQYsaI8nyp4m_CAy6h6Ez5QqiEYZ2kHBVyDd_YOldfBK2PN5K3zcDM0HVAretKKzVvVk45_XLL1YV-Xx2Tw7jVqCk3nRm-n1CCXMsOU54j_XV_2zDh2</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3055631073</pqid></control><display><type>article</type><title>CLIP with Quality Captions: A Strong Pretraining for Vision Tasks</title><source>Free E- Journals</source><creator>Pavan Kumar Anasosalu Vasu ; Pouransari, Hadi ; Faghri, Fartash ; Oncel Tuzel</creator><creatorcontrib>Pavan Kumar Anasosalu Vasu ; Pouransari, Hadi ; Faghri, Fartash ; Oncel Tuzel</creatorcontrib><description>CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Image quality ; Image segmentation ; Object recognition ; Representations ; Semantic segmentation ; Semantics ; Vision</subject><ispartof>arXiv.org, 2024-05</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Pavan Kumar Anasosalu Vasu</creatorcontrib><creatorcontrib>Pouransari, Hadi</creatorcontrib><creatorcontrib>Faghri, Fartash</creatorcontrib><creatorcontrib>Oncel Tuzel</creatorcontrib><title>CLIP with Quality Captions: A Strong Pretraining for Vision Tasks</title><title>arXiv.org</title><description>CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.</description><subject>Image quality</subject><subject>Image segmentation</subject><subject>Object recognition</subject><subject>Representations</subject><subject>Semantic segmentation</subject><subject>Semantics</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNirEKwjAUAIMgWLT_8MC5kOaZVtxKURQcKhbXkiHV1JLUvBTx7-3gBzjdwd2MRQIxTbYbIRYsJuo45yLLhZQYsaI8nyp4m_CAy6h6Ez5QqiEYZ2kHBVyDd_YOldfBK2PN5K3zcDM0HVAretKKzVvVk45_XLL1YV-Xx2Tw7jVqCk3nRm-n1CCXMsOU54j_XV_2zDh2</recordid><startdate>20240514</startdate><enddate>20240514</enddate><creator>Pavan Kumar Anasosalu Vasu</creator><creator>Pouransari, Hadi</creator><creator>Faghri, Fartash</creator><creator>Oncel Tuzel</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240514</creationdate><title>CLIP with Quality Captions: A Strong Pretraining for Vision Tasks</title><author>Pavan Kumar Anasosalu Vasu ; Pouransari, Hadi ; Faghri, Fartash ; Oncel Tuzel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30556310733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Image quality</topic><topic>Image segmentation</topic><topic>Object recognition</topic><topic>Representations</topic><topic>Semantic segmentation</topic><topic>Semantics</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Pavan Kumar Anasosalu Vasu</creatorcontrib><creatorcontrib>Pouransari, Hadi</creatorcontrib><creatorcontrib>Faghri, Fartash</creatorcontrib><creatorcontrib>Oncel Tuzel</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Pavan Kumar Anasosalu Vasu</au><au>Pouransari, Hadi</au><au>Faghri, Fartash</au><au>Oncel Tuzel</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>CLIP with Quality Captions: A Strong Pretraining for Vision Tasks</atitle><jtitle>arXiv.org</jtitle><date>2024-05-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-05
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3055631073
source	Free E- Journals
subjects	Image quality Image segmentation Object recognition Representations Semantic segmentation Semantics Vision
title	CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T17%3A20%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=CLIP%20with%20Quality%20Captions:%20A%20Strong%20Pretraining%20for%20Vision%20Tasks&rft.jtitle=arXiv.org&rft.au=Pavan%20Kumar%20Anasosalu%20Vasu&rft.date=2024-05-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3055631073%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3055631073&rft_id=info:pmid/&rfr_iscdi=true