Scattering Vision Transformer: Spectral Mixing Matters
Vision transformers have gained significant attention and achieved state-of-the-art performance in various computer vision tasks, including image classification, instance segmentation, and object detection. However, challenges remain in addressing attention complexity and effectively capturing fine-...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Patro, Badri N Agneeswaran, Vijay Srinivas |
description | Vision transformers have gained significant attention and achieved
state-of-the-art performance in various computer vision tasks, including image
classification, instance segmentation, and object detection. However,
challenges remain in addressing attention complexity and effectively capturing
fine-grained information within images. Existing solutions often resort to
down-sampling operations, such as pooling, to reduce computational cost.
Unfortunately, such operations are non-invertible and can result in information
loss. In this paper, we present a novel approach called Scattering Vision
Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally
scattering network that enables the capture of intricate image details. SVT
overcomes the invertibility issue associated with down-sampling operations by
separating low-frequency and high-frequency components. Furthermore, SVT
introduces a unique spectral gating network utilizing Einstein multiplication
for token and channel mixing, effectively reducing complexity. We show that SVT
achieves state-of-the-art performance on the ImageNet dataset with a
significant reduction in a number of parameters and FLOPS. SVT shows 2\%
improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy,
while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L
reaches 85.7\% (again state-of-art for large versions). SVT also shows
comparable results in other vision tasks such as instance segmentation. SVT
also outperforms other transformers in transfer learning on standard datasets
such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The
project page is available on this
webpage.\url{https://badripatro.github.io/svt/}. |
doi_str_mv | 10.48550/arxiv.2311.01310 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_01310</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_01310</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-b71ac0d36511af8fa382b6e3eae019297cd288a992dc118d1dcdb00b1f2b1e6d3</originalsourceid><addsrcrecordid>eNotj72KwkAURqexEPUBrMwLJHvvjJlM7ETW3QXFwmAb7vzJgElkEkTffjG71dccPs5hbImQrVWewwfFZ3hkXCBmgAJhyuTZ0DC4GNprcgl96NqkitT2vouNi5vkfHdmiHRLjuH5Zo4j3c_ZxNOtd4v_nbFq_1ntvtPD6etntz2kJAtIdYFkwAqZI5JXnoTiWjrhyAGWvCyM5UpRWXJrEJVFa6wG0Oi5RietmLHV3-0oXt9jaCi-6ndAPQaIXzMuQLw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Scattering Vision Transformer: Spectral Mixing Matters</title><source>arXiv.org</source><creator>Patro, Badri N ; Agneeswaran, Vijay Srinivas</creator><creatorcontrib>Patro, Badri N ; Agneeswaran, Vijay Srinivas</creatorcontrib><description>Vision transformers have gained significant attention and achieved
state-of-the-art performance in various computer vision tasks, including image
classification, instance segmentation, and object detection. However,
challenges remain in addressing attention complexity and effectively capturing
fine-grained information within images. Existing solutions often resort to
down-sampling operations, such as pooling, to reduce computational cost.
Unfortunately, such operations are non-invertible and can result in information
loss. In this paper, we present a novel approach called Scattering Vision
Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally
scattering network that enables the capture of intricate image details. SVT
overcomes the invertibility issue associated with down-sampling operations by
separating low-frequency and high-frequency components. Furthermore, SVT
introduces a unique spectral gating network utilizing Einstein multiplication
for token and channel mixing, effectively reducing complexity. We show that SVT
achieves state-of-the-art performance on the ImageNet dataset with a
significant reduction in a number of parameters and FLOPS. SVT shows 2\%
improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy,
while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L
reaches 85.7\% (again state-of-art for large versions). SVT also shows
comparable results in other vision tasks such as instance segmentation. SVT
also outperforms other transformers in transfer learning on standard datasets
such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The
project page is available on this
webpage.\url{https://badripatro.github.io/svt/}.</description><identifier>DOI: 10.48550/arxiv.2311.01310</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2023-11</creationdate><rights>http://creativecommons.org/licenses/by-nc-nd/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.01310$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.01310$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Patro, Badri N</creatorcontrib><creatorcontrib>Agneeswaran, Vijay Srinivas</creatorcontrib><title>Scattering Vision Transformer: Spectral Mixing Matters</title><description>Vision transformers have gained significant attention and achieved
state-of-the-art performance in various computer vision tasks, including image
classification, instance segmentation, and object detection. However,
challenges remain in addressing attention complexity and effectively capturing
fine-grained information within images. Existing solutions often resort to
down-sampling operations, such as pooling, to reduce computational cost.
Unfortunately, such operations are non-invertible and can result in information
loss. In this paper, we present a novel approach called Scattering Vision
Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally
scattering network that enables the capture of intricate image details. SVT
overcomes the invertibility issue associated with down-sampling operations by
separating low-frequency and high-frequency components. Furthermore, SVT
introduces a unique spectral gating network utilizing Einstein multiplication
for token and channel mixing, effectively reducing complexity. We show that SVT
achieves state-of-the-art performance on the ImageNet dataset with a
significant reduction in a number of parameters and FLOPS. SVT shows 2\%
improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy,
while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L
reaches 85.7\% (again state-of-art for large versions). SVT also shows
comparable results in other vision tasks such as instance segmentation. SVT
also outperforms other transformers in transfer learning on standard datasets
such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The
project page is available on this
webpage.\url{https://badripatro.github.io/svt/}.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj72KwkAURqexEPUBrMwLJHvvjJlM7ETW3QXFwmAb7vzJgElkEkTffjG71dccPs5hbImQrVWewwfFZ3hkXCBmgAJhyuTZ0DC4GNprcgl96NqkitT2vouNi5vkfHdmiHRLjuH5Zo4j3c_ZxNOtd4v_nbFq_1ntvtPD6etntz2kJAtIdYFkwAqZI5JXnoTiWjrhyAGWvCyM5UpRWXJrEJVFa6wG0Oi5RietmLHV3-0oXt9jaCi-6ndAPQaIXzMuQLw</recordid><startdate>20231102</startdate><enddate>20231102</enddate><creator>Patro, Badri N</creator><creator>Agneeswaran, Vijay Srinivas</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231102</creationdate><title>Scattering Vision Transformer: Spectral Mixing Matters</title><author>Patro, Badri N ; Agneeswaran, Vijay Srinivas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-b71ac0d36511af8fa382b6e3eae019297cd288a992dc118d1dcdb00b1f2b1e6d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Patro, Badri N</creatorcontrib><creatorcontrib>Agneeswaran, Vijay Srinivas</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Patro, Badri N</au><au>Agneeswaran, Vijay Srinivas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Scattering Vision Transformer: Spectral Mixing Matters</atitle><date>2023-11-02</date><risdate>2023</risdate><abstract>Vision transformers have gained significant attention and achieved
state-of-the-art performance in various computer vision tasks, including image
classification, instance segmentation, and object detection. However,
challenges remain in addressing attention complexity and effectively capturing
fine-grained information within images. Existing solutions often resort to
down-sampling operations, such as pooling, to reduce computational cost.
Unfortunately, such operations are non-invertible and can result in information
loss. In this paper, we present a novel approach called Scattering Vision
Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally
scattering network that enables the capture of intricate image details. SVT
overcomes the invertibility issue associated with down-sampling operations by
separating low-frequency and high-frequency components. Furthermore, SVT
introduces a unique spectral gating network utilizing Einstein multiplication
for token and channel mixing, effectively reducing complexity. We show that SVT
achieves state-of-the-art performance on the ImageNet dataset with a
significant reduction in a number of parameters and FLOPS. SVT shows 2\%
improvement over LiTv2 and iFormer. SVT-H-S reaches 84.2\% top-1 accuracy,
while SVT-H-B reaches 85.2\% (state-of-art for base versions) and SVT-H-L
reaches 85.7\% (again state-of-art for large versions). SVT also shows
comparable results in other vision tasks such as instance segmentation. SVT
also outperforms other transformers in transfer learning on standard datasets
such as CIFAR10, CIFAR100, Oxford Flower, and Stanford Car datasets. The
project page is available on this
webpage.\url{https://badripatro.github.io/svt/}.</abstract><doi>10.48550/arxiv.2311.01310</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2311.01310 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2311_01310 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning |
title | Scattering Vision Transformer: Spectral Mixing Matters |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T14%3A26%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Scattering%20Vision%20Transformer:%20Spectral%20Mixing%20Matters&rft.au=Patro,%20Badri%20N&rft.date=2023-11-02&rft_id=info:doi/10.48550/arxiv.2311.01310&rft_dat=%3Carxiv_GOX%3E2311_01310%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |