TIPS: Text-Image Pretraining with Spatial Awareness

While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense visio...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Maninis, Kevis-Kokitsi, Chen, Kaifeng, Ghosh, Soham, Karpur, Arjun, Chen, Koert, Xia, Ye, Cao, Bingyi, Salz, Daniel, Han, Guangxing, Dlabal, Jan, Gnanapragasam, Dan, Seyedhosseini, Mojtaba, Zhou, Howard, Araujo, Andre
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Maninis, Kevis-Kokitsi Chen, Kaifeng Ghosh, Soham Karpur, Arjun Chen, Koert Xia, Ye Cao, Bingyi Salz, Daniel Han, Guangxing Dlabal, Jan Gnanapragasam, Dan Seyedhosseini, Mojtaba Zhou, Howard Araujo, Andre
description	While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image-text and self-supervised learning, by proposing a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks. Our method, which we refer to as Text-Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks.
doi_str_mv	10.48550/arxiv.2410.16512
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2410_16512</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2410_16512</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2410_165123</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGJqZGhpxMhiHeAYEWymEpFaU6HrmJqanKgQUpZYUJWbmZealK5RnlmQoBBcklmQm5ig4licWpealFhfzMLCmJeYUp_JCaW4GeTfXEGcPXbDx8QVFmbmJRZXxIGviwdYYE1YBAJbUMS4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>TIPS: Text-Image Pretraining with Spatial Awareness</title><source>arXiv.org</source><creator>Maninis, Kevis-Kokitsi ; Chen, Kaifeng ; Ghosh, Soham ; Karpur, Arjun ; Chen, Koert ; Xia, Ye ; Cao, Bingyi ; Salz, Daniel ; Han, Guangxing ; Dlabal, Jan ; Gnanapragasam, Dan ; Seyedhosseini, Mojtaba ; Zhou, Howard ; Araujo, Andre</creator><creatorcontrib>Maninis, Kevis-Kokitsi ; Chen, Kaifeng ; Ghosh, Soham ; Karpur, Arjun ; Chen, Koert ; Xia, Ye ; Cao, Bingyi ; Salz, Daniel ; Han, Guangxing ; Dlabal, Jan ; Gnanapragasam, Dan ; Seyedhosseini, Mojtaba ; Zhou, Howard ; Araujo, Andre</creatorcontrib><description>While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image-text and self-supervised learning, by proposing a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks. Our method, which we refer to as Text-Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks.</description><identifier>DOI: 10.48550/arxiv.2410.16512</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2410.16512$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2410.16512$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Maninis, Kevis-Kokitsi</creatorcontrib><creatorcontrib>Chen, Kaifeng</creatorcontrib><creatorcontrib>Ghosh, Soham</creatorcontrib><creatorcontrib>Karpur, Arjun</creatorcontrib><creatorcontrib>Chen, Koert</creatorcontrib><creatorcontrib>Xia, Ye</creatorcontrib><creatorcontrib>Cao, Bingyi</creatorcontrib><creatorcontrib>Salz, Daniel</creatorcontrib><creatorcontrib>Han, Guangxing</creatorcontrib><creatorcontrib>Dlabal, Jan</creatorcontrib><creatorcontrib>Gnanapragasam, Dan</creatorcontrib><creatorcontrib>Seyedhosseini, Mojtaba</creatorcontrib><creatorcontrib>Zhou, Howard</creatorcontrib><creatorcontrib>Araujo, Andre</creatorcontrib><title>TIPS: Text-Image Pretraining with Spatial Awareness</title><description>While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image-text and self-supervised learning, by proposing a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks. Our method, which we refer to as Text-Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGJqZGhpxMhiHeAYEWymEpFaU6HrmJqanKgQUpZYUJWbmZealK5RnlmQoBBcklmQm5ig4licWpealFhfzMLCmJeYUp_JCaW4GeTfXEGcPXbDx8QVFmbmJRZXxIGviwdYYE1YBAJbUMS4</recordid><startdate>20241021</startdate><enddate>20241021</enddate><creator>Maninis, Kevis-Kokitsi</creator><creator>Chen, Kaifeng</creator><creator>Ghosh, Soham</creator><creator>Karpur, Arjun</creator><creator>Chen, Koert</creator><creator>Xia, Ye</creator><creator>Cao, Bingyi</creator><creator>Salz, Daniel</creator><creator>Han, Guangxing</creator><creator>Dlabal, Jan</creator><creator>Gnanapragasam, Dan</creator><creator>Seyedhosseini, Mojtaba</creator><creator>Zhou, Howard</creator><creator>Araujo, Andre</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241021</creationdate><title>TIPS: Text-Image Pretraining with Spatial Awareness</title><author>Maninis, Kevis-Kokitsi ; Chen, Kaifeng ; Ghosh, Soham ; Karpur, Arjun ; Chen, Koert ; Xia, Ye ; Cao, Bingyi ; Salz, Daniel ; Han, Guangxing ; Dlabal, Jan ; Gnanapragasam, Dan ; Seyedhosseini, Mojtaba ; Zhou, Howard ; Araujo, Andre</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2410_165123</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Maninis, Kevis-Kokitsi</creatorcontrib><creatorcontrib>Chen, Kaifeng</creatorcontrib><creatorcontrib>Ghosh, Soham</creatorcontrib><creatorcontrib>Karpur, Arjun</creatorcontrib><creatorcontrib>Chen, Koert</creatorcontrib><creatorcontrib>Xia, Ye</creatorcontrib><creatorcontrib>Cao, Bingyi</creatorcontrib><creatorcontrib>Salz, Daniel</creatorcontrib><creatorcontrib>Han, Guangxing</creatorcontrib><creatorcontrib>Dlabal, Jan</creatorcontrib><creatorcontrib>Gnanapragasam, Dan</creatorcontrib><creatorcontrib>Seyedhosseini, Mojtaba</creatorcontrib><creatorcontrib>Zhou, Howard</creatorcontrib><creatorcontrib>Araujo, Andre</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Maninis, Kevis-Kokitsi</au><au>Chen, Kaifeng</au><au>Ghosh, Soham</au><au>Karpur, Arjun</au><au>Chen, Koert</au><au>Xia, Ye</au><au>Cao, Bingyi</au><au>Salz, Daniel</au><au>Han, Guangxing</au><au>Dlabal, Jan</au><au>Gnanapragasam, Dan</au><au>Seyedhosseini, Mojtaba</au><au>Zhou, Howard</au><au>Araujo, Andre</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>TIPS: Text-Image Pretraining with Spatial Awareness</atitle><date>2024-10-21</date><risdate>2024</risdate><abstract>While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised image-only pretraining is still the go-to method for many dense vision applications (e.g. depth estimation, semantic segmentation), despite the lack of explicit supervisory signals. In this paper, we close this gap between image-text and self-supervised learning, by proposing a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks. Our method, which we refer to as Text-Image Pretraining with Spatial awareness (TIPS), leverages two simple and effective insights. First, on textual supervision: we reveal that replacing noisy web image captions by synthetically generated textual descriptions boosts dense understanding performance significantly, due to a much richer signal for learning spatially aware representations. We propose an adapted training method that combines noisy and synthetic captions, resulting in improvements across both dense and global understanding tasks. Second, on the learning technique: we propose to combine contrastive image-text learning with self-supervised masked image modeling, to encourage spatial coherence, unlocking substantial enhancements for downstream applications. Building on these two ideas, we scale our model using the transformer architecture, trained on a curated set of public images. Our experiments are conducted on 8 tasks involving 16 datasets in total, demonstrating strong off-the-shelf performance on both dense and global understanding, for several image-only and image-text tasks.</abstract><doi>10.48550/arxiv.2410.16512</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2410.16512
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2410_16512
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	TIPS: Text-Image Pretraining with Spatial Awareness
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T19%3A33%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=TIPS:%20Text-Image%20Pretraining%20with%20Spatial%20Awareness&rft.au=Maninis,%20Kevis-Kokitsi&rft.date=2024-10-21&rft_id=info:doi/10.48550/arxiv.2410.16512&rft_dat=%3Carxiv_GOX%3E2410_16512%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true