CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-05
Hauptverfasser: Pavan Kumar Anasosalu Vasu, Pouransari, Hadi, Faghri, Fartash, Oncel Tuzel
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Pavan Kumar Anasosalu Vasu
Pouransari, Hadi
Faghri, Fartash
Oncel Tuzel
description CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1\(\times\) smaller. Moreover, we show that improving caption quality results in \(10\times\) data efficiency when finetuning for dense prediction tasks.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3055631073</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3055631073</sourcerecordid><originalsourceid>FETCH-proquest_journals_30556310733</originalsourceid><addsrcrecordid>eNqNirEKwjAUAIMgWLT_8MC5kOaZVtxKURQcKhbXkiHV1JLUvBTx7-3gBzjdwd2MRQIxTbYbIRYsJuo45yLLhZQYsaI8nyp4m_CAy6h6Ez5QqiEYZ2kHBVyDd_YOldfBK2PN5K3zcDM0HVAretKKzVvVk45_XLL1YV-Xx2Tw7jVqCk3nRm-n1CCXMsOU54j_XV_2zDh2</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3055631073</pqid></control><display><type>article</type><title>CLIP with Quality Captions: A Strong Pretraining for Vision Tasks</title><source>Free E- Journals</source><creator>Pavan Kumar Anasosalu Vasu ; Pouransari, Hadi ; Faghri, Fartash ; Oncel Tuzel</creator><creatorcontrib>Pavan Kumar Anasosalu Vasu ; Pouransari, Hadi ; Faghri, Fartash ; Oncel Tuzel</creatorcontrib><description>CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1\(\times\) smaller. Moreover, we show that improving caption quality results in \(10\times\) data efficiency when finetuning for dense prediction tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Image quality ; Image segmentation ; Object recognition ; Representations ; Semantic segmentation ; Semantics ; Vision</subject><ispartof>arXiv.org, 2024-05</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Pavan Kumar Anasosalu Vasu</creatorcontrib><creatorcontrib>Pouransari, Hadi</creatorcontrib><creatorcontrib>Faghri, Fartash</creatorcontrib><creatorcontrib>Oncel Tuzel</creatorcontrib><title>CLIP with Quality Captions: A Strong Pretraining for Vision Tasks</title><title>arXiv.org</title><description>CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1\(\times\) smaller. Moreover, we show that improving caption quality results in \(10\times\) data efficiency when finetuning for dense prediction tasks.</description><subject>Image quality</subject><subject>Image segmentation</subject><subject>Object recognition</subject><subject>Representations</subject><subject>Semantic segmentation</subject><subject>Semantics</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNirEKwjAUAIMgWLT_8MC5kOaZVtxKURQcKhbXkiHV1JLUvBTx7-3gBzjdwd2MRQIxTbYbIRYsJuo45yLLhZQYsaI8nyp4m_CAy6h6Ez5QqiEYZ2kHBVyDd_YOldfBK2PN5K3zcDM0HVAretKKzVvVk45_XLL1YV-Xx2Tw7jVqCk3nRm-n1CCXMsOU54j_XV_2zDh2</recordid><startdate>20240514</startdate><enddate>20240514</enddate><creator>Pavan Kumar Anasosalu Vasu</creator><creator>Pouransari, Hadi</creator><creator>Faghri, Fartash</creator><creator>Oncel Tuzel</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240514</creationdate><title>CLIP with Quality Captions: A Strong Pretraining for Vision Tasks</title><author>Pavan Kumar Anasosalu Vasu ; Pouransari, Hadi ; Faghri, Fartash ; Oncel Tuzel</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30556310733</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Image quality</topic><topic>Image segmentation</topic><topic>Object recognition</topic><topic>Representations</topic><topic>Semantic segmentation</topic><topic>Semantics</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Pavan Kumar Anasosalu Vasu</creatorcontrib><creatorcontrib>Pouransari, Hadi</creatorcontrib><creatorcontrib>Faghri, Fartash</creatorcontrib><creatorcontrib>Oncel Tuzel</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central</collection><collection>SciTech Premium Collection (Proquest) (PQ_SDU_P3)</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Pavan Kumar Anasosalu Vasu</au><au>Pouransari, Hadi</au><au>Faghri, Fartash</au><au>Oncel Tuzel</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>CLIP with Quality Captions: A Strong Pretraining for Vision Tasks</atitle><jtitle>arXiv.org</jtitle><date>2024-05-14</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1\(\times\) smaller. Moreover, we show that improving caption quality results in \(10\times\) data efficiency when finetuning for dense prediction tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-05
issn 2331-8422
language eng
recordid cdi_proquest_journals_3055631073
source Free E- Journals
subjects Image quality
Image segmentation
Object recognition
Representations
Semantic segmentation
Semantics
Vision
title CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T17%3A20%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=CLIP%20with%20Quality%20Captions:%20A%20Strong%20Pretraining%20for%20Vision%20Tasks&rft.jtitle=arXiv.org&rft.au=Pavan%20Kumar%20Anasosalu%20Vasu&rft.date=2024-05-14&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3055631073%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3055631073&rft_id=info:pmid/&rfr_iscdi=true