An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a toke...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Nguyen, Duy-Kien Assran, Mahmoud Jain, Unnat Oswald, Martin R Snoek, Cees G. M Chen, Xinlei |
description | This work does not introduce a new method. Instead, we present an interesting
finding that questions the necessity of the inductive bias -- locality in
modern computer vision architectures. Concretely, we find that vanilla
Transformers can operate by directly treating each individual pixel as a token
and achieve highly performant results. This is substantially different from the
popular design in Vision Transformer, which maintains the inductive bias from
ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a
token). We mainly showcase the effectiveness of pixels-as-tokens across three
well-studied tasks in computer vision: supervised learning for object
classification, self-supervised learning via masked autoencoding, and image
generation with diffusion models. Although directly operating on individual
pixels is less computationally practical, we believe the community must be
aware of this surprising piece of knowledge when devising the next generation
of neural architectures for computer vision. |
doi_str_mv | 10.48550/arxiv.2406.09415 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_09415</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_09415</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-755f2e34943826fe43ffb4363267ebc986217cebcc4360a5bfde3a790a878a3d3</originalsourceid><addsrcrecordid>eNotj7tOwzAYRr0woJYHYOJ_gQTH93SrqgKVWtEhgjH6k9iNpVwqu1Th7QmF6Tv6hiMdQh4zmgojJX3GMPlrygRVKc1FJu_Jx3qAXY8nCz7C5xguLRzGYKFocYBMTZmCI17q1sYVbKdzNwY_nKAIOEQ3ht6GCONsGBp_9c0XdnD0k-3iktw57KJ9-N8FKV62xeYt2b-_7jbrfYJKy0RL6ZjlIhfcMOWs4M5VgivOlLZVnRvFMl3PVM8nRVm5xnLUOUWjDfKGL8jTn_YWVp6D7zF8l7-B5S2Q_wAdRUpy</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels</title><source>arXiv.org</source><creator>Nguyen, Duy-Kien ; Assran, Mahmoud ; Jain, Unnat ; Oswald, Martin R ; Snoek, Cees G. M ; Chen, Xinlei</creator><creatorcontrib>Nguyen, Duy-Kien ; Assran, Mahmoud ; Jain, Unnat ; Oswald, Martin R ; Snoek, Cees G. M ; Chen, Xinlei</creatorcontrib><description>This work does not introduce a new method. Instead, we present an interesting
finding that questions the necessity of the inductive bias -- locality in
modern computer vision architectures. Concretely, we find that vanilla
Transformers can operate by directly treating each individual pixel as a token
and achieve highly performant results. This is substantially different from the
popular design in Vision Transformer, which maintains the inductive bias from
ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a
token). We mainly showcase the effectiveness of pixels-as-tokens across three
well-studied tasks in computer vision: supervised learning for object
classification, self-supervised learning via masked autoencoding, and image
generation with diffusion models. Although directly operating on individual
pixels is less computationally practical, we believe the community must be
aware of this surprising piece of knowledge when devising the next generation
of neural architectures for computer vision.</description><identifier>DOI: 10.48550/arxiv.2406.09415</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.09415$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.09415$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Nguyen, Duy-Kien</creatorcontrib><creatorcontrib>Assran, Mahmoud</creatorcontrib><creatorcontrib>Jain, Unnat</creatorcontrib><creatorcontrib>Oswald, Martin R</creatorcontrib><creatorcontrib>Snoek, Cees G. M</creatorcontrib><creatorcontrib>Chen, Xinlei</creatorcontrib><title>An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels</title><description>This work does not introduce a new method. Instead, we present an interesting
finding that questions the necessity of the inductive bias -- locality in
modern computer vision architectures. Concretely, we find that vanilla
Transformers can operate by directly treating each individual pixel as a token
and achieve highly performant results. This is substantially different from the
popular design in Vision Transformer, which maintains the inductive bias from
ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a
token). We mainly showcase the effectiveness of pixels-as-tokens across three
well-studied tasks in computer vision: supervised learning for object
classification, self-supervised learning via masked autoencoding, and image
generation with diffusion models. Although directly operating on individual
pixels is less computationally practical, we believe the community must be
aware of this surprising piece of knowledge when devising the next generation
of neural architectures for computer vision.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj7tOwzAYRr0woJYHYOJ_gQTH93SrqgKVWtEhgjH6k9iNpVwqu1Th7QmF6Tv6hiMdQh4zmgojJX3GMPlrygRVKc1FJu_Jx3qAXY8nCz7C5xguLRzGYKFocYBMTZmCI17q1sYVbKdzNwY_nKAIOEQ3ht6GCONsGBp_9c0XdnD0k-3iktw57KJ9-N8FKV62xeYt2b-_7jbrfYJKy0RL6ZjlIhfcMOWs4M5VgivOlLZVnRvFMl3PVM8nRVm5xnLUOUWjDfKGL8jTn_YWVp6D7zF8l7-B5S2Q_wAdRUpy</recordid><startdate>20240613</startdate><enddate>20240613</enddate><creator>Nguyen, Duy-Kien</creator><creator>Assran, Mahmoud</creator><creator>Jain, Unnat</creator><creator>Oswald, Martin R</creator><creator>Snoek, Cees G. M</creator><creator>Chen, Xinlei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240613</creationdate><title>An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels</title><author>Nguyen, Duy-Kien ; Assran, Mahmoud ; Jain, Unnat ; Oswald, Martin R ; Snoek, Cees G. M ; Chen, Xinlei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-755f2e34943826fe43ffb4363267ebc986217cebcc4360a5bfde3a790a878a3d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Nguyen, Duy-Kien</creatorcontrib><creatorcontrib>Assran, Mahmoud</creatorcontrib><creatorcontrib>Jain, Unnat</creatorcontrib><creatorcontrib>Oswald, Martin R</creatorcontrib><creatorcontrib>Snoek, Cees G. M</creatorcontrib><creatorcontrib>Chen, Xinlei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Nguyen, Duy-Kien</au><au>Assran, Mahmoud</au><au>Jain, Unnat</au><au>Oswald, Martin R</au><au>Snoek, Cees G. M</au><au>Chen, Xinlei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels</atitle><date>2024-06-13</date><risdate>2024</risdate><abstract>This work does not introduce a new method. Instead, we present an interesting
finding that questions the necessity of the inductive bias -- locality in
modern computer vision architectures. Concretely, we find that vanilla
Transformers can operate by directly treating each individual pixel as a token
and achieve highly performant results. This is substantially different from the
popular design in Vision Transformer, which maintains the inductive bias from
ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a
token). We mainly showcase the effectiveness of pixels-as-tokens across three
well-studied tasks in computer vision: supervised learning for object
classification, self-supervised learning via masked autoencoding, and image
generation with diffusion models. Although directly operating on individual
pixels is less computationally practical, we believe the community must be
aware of this surprising piece of knowledge when devising the next generation
of neural architectures for computer vision.</abstract><doi>10.48550/arxiv.2406.09415</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2406.09415 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2406_09415 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning |
title | An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T12%3A00%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Image%20is%20Worth%20More%20Than%2016x16%20Patches:%20Exploring%20Transformers%20on%20Individual%20Pixels&rft.au=Nguyen,%20Duy-Kien&rft.date=2024-06-13&rft_id=info:doi/10.48550/arxiv.2406.09415&rft_dat=%3Carxiv_GOX%3E2406_09415%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |