Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts
Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due to their ability to decouple model size from inference efficiency by only activating a small subset of the model parameters for any given input token. As such, sparse MoEs have enabled unprecedented scalability, resulting in...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Daxberger, Erik Weers, Floris Zhang, Bowen Gunter, Tom Pang, Ruoming Eichner, Marcin Emmersberger, Michael Yang, Yinfei Toshev, Alexander Du, Xianzhi |
description | Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due
to their ability to decouple model size from inference efficiency by only
activating a small subset of the model parameters for any given input token. As
such, sparse MoEs have enabled unprecedented scalability, resulting in
tremendous successes across domains such as natural language processing and
computer vision. In this work, we instead explore the use of sparse MoEs to
scale-down Vision Transformers (ViTs) to make them more attractive for
resource-constrained vision applications. To this end, we propose a simplified
and mobile-friendly MoE design where entire images rather than individual
patches are routed to the experts. We also propose a stable MoE training
procedure that uses super-class information to guide the router. We empirically
show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off
between performance and efficiency than the corresponding dense ViTs. For
example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense
counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only
54M FLOPs inference cost, our MoE achieves an improvement of 4.66%. |
doi_str_mv | 10.48550/arxiv.2309.04354 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2309_04354</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2309_04354</sourcerecordid><originalsourceid>FETCH-LOGICAL-a674-fff9447eabcdba2f6fc4da9066f47f4892180624bf93172ece5f725eba27de3</originalsourceid><addsrcrecordid>eNotz71OwzAUhmEvDKhwAUz4Bhwcx46TbqiEH6kREq26RifJOZWlNI7sUsLdAy3Tt7z6pIexu1QmujBGPkCY3SlRmSwTqTOjr1ld-9YNyHei9lVc8k0Hgxv3_Ml_jXznovMj3wYYI_lwwBD5yQHfTBAi8trNx8-AwpOo5gnDMd6wK4Ih4u3_LtjHc7VdvYr1-8vb6nEtILdaEFGptUVou74FRTl1uodS5jlpS7ooVVrIXOmWyiy1Cjs0ZJXB39b2mC3Y_eX0rGmm4A4Qvps_VXNWZT-FqkhU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts</title><source>arXiv.org</source><creator>Daxberger, Erik ; Weers, Floris ; Zhang, Bowen ; Gunter, Tom ; Pang, Ruoming ; Eichner, Marcin ; Emmersberger, Michael ; Yang, Yinfei ; Toshev, Alexander ; Du, Xianzhi</creator><creatorcontrib>Daxberger, Erik ; Weers, Floris ; Zhang, Bowen ; Gunter, Tom ; Pang, Ruoming ; Eichner, Marcin ; Emmersberger, Michael ; Yang, Yinfei ; Toshev, Alexander ; Du, Xianzhi</creatorcontrib><description>Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due
to their ability to decouple model size from inference efficiency by only
activating a small subset of the model parameters for any given input token. As
such, sparse MoEs have enabled unprecedented scalability, resulting in
tremendous successes across domains such as natural language processing and
computer vision. In this work, we instead explore the use of sparse MoEs to
scale-down Vision Transformers (ViTs) to make them more attractive for
resource-constrained vision applications. To this end, we propose a simplified
and mobile-friendly MoE design where entire images rather than individual
patches are routed to the experts. We also propose a stable MoE training
procedure that uses super-class information to guide the router. We empirically
show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off
between performance and efficiency than the corresponding dense ViTs. For
example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense
counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only
54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.</description><identifier>DOI: 10.48550/arxiv.2309.04354</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2023-09</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2309.04354$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2309.04354$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Daxberger, Erik</creatorcontrib><creatorcontrib>Weers, Floris</creatorcontrib><creatorcontrib>Zhang, Bowen</creatorcontrib><creatorcontrib>Gunter, Tom</creatorcontrib><creatorcontrib>Pang, Ruoming</creatorcontrib><creatorcontrib>Eichner, Marcin</creatorcontrib><creatorcontrib>Emmersberger, Michael</creatorcontrib><creatorcontrib>Yang, Yinfei</creatorcontrib><creatorcontrib>Toshev, Alexander</creatorcontrib><creatorcontrib>Du, Xianzhi</creatorcontrib><title>Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts</title><description>Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due
to their ability to decouple model size from inference efficiency by only
activating a small subset of the model parameters for any given input token. As
such, sparse MoEs have enabled unprecedented scalability, resulting in
tremendous successes across domains such as natural language processing and
computer vision. In this work, we instead explore the use of sparse MoEs to
scale-down Vision Transformers (ViTs) to make them more attractive for
resource-constrained vision applications. To this end, we propose a simplified
and mobile-friendly MoE design where entire images rather than individual
patches are routed to the experts. We also propose a stable MoE training
procedure that uses super-class information to guide the router. We empirically
show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off
between performance and efficiency than the corresponding dense ViTs. For
example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense
counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only
54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAUhmEvDKhwAUz4Bhwcx46TbqiEH6kREq26RifJOZWlNI7sUsLdAy3Tt7z6pIexu1QmujBGPkCY3SlRmSwTqTOjr1ld-9YNyHei9lVc8k0Hgxv3_Ml_jXznovMj3wYYI_lwwBD5yQHfTBAi8trNx8-AwpOo5gnDMd6wK4Ih4u3_LtjHc7VdvYr1-8vb6nEtILdaEFGptUVou74FRTl1uodS5jlpS7ooVVrIXOmWyiy1Cjs0ZJXB39b2mC3Y_eX0rGmm4A4Qvps_VXNWZT-FqkhU</recordid><startdate>20230908</startdate><enddate>20230908</enddate><creator>Daxberger, Erik</creator><creator>Weers, Floris</creator><creator>Zhang, Bowen</creator><creator>Gunter, Tom</creator><creator>Pang, Ruoming</creator><creator>Eichner, Marcin</creator><creator>Emmersberger, Michael</creator><creator>Yang, Yinfei</creator><creator>Toshev, Alexander</creator><creator>Du, Xianzhi</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20230908</creationdate><title>Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts</title><author>Daxberger, Erik ; Weers, Floris ; Zhang, Bowen ; Gunter, Tom ; Pang, Ruoming ; Eichner, Marcin ; Emmersberger, Michael ; Yang, Yinfei ; Toshev, Alexander ; Du, Xianzhi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a674-fff9447eabcdba2f6fc4da9066f47f4892180624bf93172ece5f725eba27de3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Daxberger, Erik</creatorcontrib><creatorcontrib>Weers, Floris</creatorcontrib><creatorcontrib>Zhang, Bowen</creatorcontrib><creatorcontrib>Gunter, Tom</creatorcontrib><creatorcontrib>Pang, Ruoming</creatorcontrib><creatorcontrib>Eichner, Marcin</creatorcontrib><creatorcontrib>Emmersberger, Michael</creatorcontrib><creatorcontrib>Yang, Yinfei</creatorcontrib><creatorcontrib>Toshev, Alexander</creatorcontrib><creatorcontrib>Du, Xianzhi</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Daxberger, Erik</au><au>Weers, Floris</au><au>Zhang, Bowen</au><au>Gunter, Tom</au><au>Pang, Ruoming</au><au>Eichner, Marcin</au><au>Emmersberger, Michael</au><au>Yang, Yinfei</au><au>Toshev, Alexander</au><au>Du, Xianzhi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts</atitle><date>2023-09-08</date><risdate>2023</risdate><abstract>Sparse Mixture-of-Experts models (MoEs) have recently gained popularity due
to their ability to decouple model size from inference efficiency by only
activating a small subset of the model parameters for any given input token. As
such, sparse MoEs have enabled unprecedented scalability, resulting in
tremendous successes across domains such as natural language processing and
computer vision. In this work, we instead explore the use of sparse MoEs to
scale-down Vision Transformers (ViTs) to make them more attractive for
resource-constrained vision applications. To this end, we propose a simplified
and mobile-friendly MoE design where entire images rather than individual
patches are routed to the experts. We also propose a stable MoE training
procedure that uses super-class information to guide the router. We empirically
show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off
between performance and efficiency than the corresponding dense ViTs. For
example, for the ViT-Tiny model, our Mobile V-MoE outperforms its dense
counterpart by 3.39% on ImageNet-1k. For an even smaller ViT variant with only
54M FLOPs inference cost, our MoE achieves an improvement of 4.66%.</abstract><doi>10.48550/arxiv.2309.04354</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2309.04354 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2309_04354 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning Statistics - Machine Learning |
title | Mobile V-MoEs: Scaling Down Vision Transformers via Sparse Mixture-of-Experts |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T09%3A38%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Mobile%20V-MoEs:%20Scaling%20Down%20Vision%20Transformers%20via%20Sparse%20Mixture-of-Experts&rft.au=Daxberger,%20Erik&rft.date=2023-09-08&rft_id=info:doi/10.48550/arxiv.2309.04354&rft_dat=%3Carxiv_GOX%3E2309_04354%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |