Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effect...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Camporese, Guglielmo, Izzo, Elena, Ballan, Lamberto
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Camporese, Guglielmo Izzo, Elena Ballan, Lamberto
description	Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly the supervised task. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signals at each training step. We investigated our methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets. Code is available at: https://github.com/guglielmocamporese/relvit.
doi_str_mv	10.48550/arxiv.2206.00481
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2206_00481</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2206_00481</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-99392294f321d88ae2adef810378e208d15c470faa121859547a47b94fc6eaf53</originalsourceid><addsrcrecordid>eNotj8tKw0AYhWfjQqoP4Mp5gcS5ZiYrkVIvUKrYoOAm_E3-aQZyYyaW9u2N1cXhLA7fgY-QG85SZbVmdxCO_pAKwbKUMWX5Jfn6bDAghTndiW7Q75vdEOI9XR3HdvCT7_f0DaaqwUjfsYXJD32kvqdbbF2y_R4xHHzEmn74OE-0CNBHN4QOwxW5cNBGvP7vBSkeV8XyOVm_Pr0sH9YJZIYneS5zIXLlpOC1tYACanSWM2ksCmZrritlmAPggluda2VAmd0MVBmC03JBbv9uz3LlGHwH4VT-SpZnSfkDll9M6A</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer</title><source>arXiv.org</source><creator>Camporese, Guglielmo ; Izzo, Elena ; Ballan, Lamberto</creator><creatorcontrib>Camporese, Guglielmo ; Izzo, Elena ; Ballan, Lamberto</creatorcontrib><description>Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly the supervised task. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signals at each training step. We investigated our methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets. Code is available at: https://github.com/guglielmocamporese/relvit.</description><identifier>DOI: 10.48550/arxiv.2206.00481</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2022-06</creationdate><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2206.00481$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2206.00481$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Camporese, Guglielmo</creatorcontrib><creatorcontrib>Izzo, Elena</creatorcontrib><creatorcontrib>Ballan, Lamberto</creatorcontrib><title>Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer</title><description>Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly the supervised task. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signals at each training step. We investigated our methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets. Code is available at: https://github.com/guglielmocamporese/relvit.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tKw0AYhWfjQqoP4Mp5gcS5ZiYrkVIvUKrYoOAm_E3-aQZyYyaW9u2N1cXhLA7fgY-QG85SZbVmdxCO_pAKwbKUMWX5Jfn6bDAghTndiW7Q75vdEOI9XR3HdvCT7_f0DaaqwUjfsYXJD32kvqdbbF2y_R4xHHzEmn74OE-0CNBHN4QOwxW5cNBGvP7vBSkeV8XyOVm_Pr0sH9YJZIYneS5zIXLlpOC1tYACanSWM2ksCmZrritlmAPggluda2VAmd0MVBmC03JBbv9uz3LlGHwH4VT-SpZnSfkDll9M6A</recordid><startdate>20220601</startdate><enddate>20220601</enddate><creator>Camporese, Guglielmo</creator><creator>Izzo, Elena</creator><creator>Ballan, Lamberto</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220601</creationdate><title>Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer</title><author>Camporese, Guglielmo ; Izzo, Elena ; Ballan, Lamberto</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-99392294f321d88ae2adef810378e208d15c470faa121859547a47b94fc6eaf53</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Camporese, Guglielmo</creatorcontrib><creatorcontrib>Izzo, Elena</creatorcontrib><creatorcontrib>Ballan, Lamberto</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Camporese, Guglielmo</au><au>Izzo, Elena</au><au>Ballan, Lamberto</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer</atitle><date>2022-06-01</date><risdate>2022</risdate><abstract>Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly the supervised task. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signals at each training step. We investigated our methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets. Code is available at: https://github.com/guglielmocamporese/relvit.</abstract><doi>10.48550/arxiv.2206.00481</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2206.00481
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2206_00481
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
title	Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T07%3A15%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Where%20are%20my%20Neighbors?%20Exploiting%20Patches%20Relations%20in%20Self-Supervised%20Vision%20Transformer&rft.au=Camporese,%20Guglielmo&rft.date=2022-06-01&rft_id=info:doi/10.48550/arxiv.2206.00481&rft_dat=%3Carxiv_GOX%3E2206_00481%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true