Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Jawahar, Ganesh, Yang, Haichuan, Xiong, Yunyang, Liu, Zechun, Wang, Dilin, Sun, Fei, Li, Meng, Pappu, Aasish, Oguz, Barlas, Abdul-Mageed, Muhammad, Lakshmanan, Laks V. S, Krishnamoorthi, Raghuraman, Chandra, Vikas
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Jawahar, Ganesh
Yang, Haichuan
Xiong, Yunyang
Liu, Zechun
Wang, Dilin
Sun, Fei
Li, Meng
Pappu, Aasish
Oguz, Barlas
Abdul-Mageed, Muhammad
Lakshmanan, Laks V. S
Krishnamoorthi, Raghuraman
Chandra, Vikas
description Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.
doi_str_mv 10.48550/arxiv.2306.04845
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_04845</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_04845</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-cf33fed7f2051ceef322237c5ecfb1dc606fe6688104dd16579edb0754c205e63</originalsourceid><addsrcrecordid>eNpNj81Kw0AUhWfjQqoP4Kp5gYmT-Y3uSqlaaBFsoMswnbnTDLRJmExqfHubqODqcA_3fPAh9JCRlOdCkEcdBn9JKSMyJTzn4hadtn6IfQDcOLzrWwg1xO45WZ_b0Fx8fUz24I9VxLtKh_H8-0mKoH09Np8-VskimMpHMBPqo-kj2OQfeTVcV7G7QzdOnzq4_80ZKl5WxfINb95f18vFBmupBDaOMQdWOUpEZgAco5QyZQQYd8iskUQ6kDLPM8KtzaRQT2APRAlurguQbIbmP9hJt2yDP-vwVY7a5aTNvgFM9FVM</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts</title><source>arXiv.org</source><creator>Jawahar, Ganesh ; Yang, Haichuan ; Xiong, Yunyang ; Liu, Zechun ; Wang, Dilin ; Sun, Fei ; Li, Meng ; Pappu, Aasish ; Oguz, Barlas ; Abdul-Mageed, Muhammad ; Lakshmanan, Laks V. S ; Krishnamoorthi, Raghuraman ; Chandra, Vikas</creator><creatorcontrib>Jawahar, Ganesh ; Yang, Haichuan ; Xiong, Yunyang ; Liu, Zechun ; Wang, Dilin ; Sun, Fei ; Li, Meng ; Pappu, Aasish ; Oguz, Barlas ; Abdul-Mageed, Muhammad ; Lakshmanan, Laks V. S ; Krishnamoorthi, Raghuraman ; Chandra, Vikas</creatorcontrib><description>Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.</description><identifier>DOI: 10.48550/arxiv.2306.04845</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2023-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.04845$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.04845$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jawahar, Ganesh</creatorcontrib><creatorcontrib>Yang, Haichuan</creatorcontrib><creatorcontrib>Xiong, Yunyang</creatorcontrib><creatorcontrib>Liu, Zechun</creatorcontrib><creatorcontrib>Wang, Dilin</creatorcontrib><creatorcontrib>Sun, Fei</creatorcontrib><creatorcontrib>Li, Meng</creatorcontrib><creatorcontrib>Pappu, Aasish</creatorcontrib><creatorcontrib>Oguz, Barlas</creatorcontrib><creatorcontrib>Abdul-Mageed, Muhammad</creatorcontrib><creatorcontrib>Lakshmanan, Laks V. S</creatorcontrib><creatorcontrib>Krishnamoorthi, Raghuraman</creatorcontrib><creatorcontrib>Chandra, Vikas</creatorcontrib><title>Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts</title><description>Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpNj81Kw0AUhWfjQqoP4Kp5gYmT-Y3uSqlaaBFsoMswnbnTDLRJmExqfHubqODqcA_3fPAh9JCRlOdCkEcdBn9JKSMyJTzn4hadtn6IfQDcOLzrWwg1xO45WZ_b0Fx8fUz24I9VxLtKh_H8-0mKoH09Np8-VskimMpHMBPqo-kj2OQfeTVcV7G7QzdOnzq4_80ZKl5WxfINb95f18vFBmupBDaOMQdWOUpEZgAco5QyZQQYd8iskUQ6kDLPM8KtzaRQT2APRAlurguQbIbmP9hJt2yDP-vwVY7a5aTNvgFM9FVM</recordid><startdate>20230607</startdate><enddate>20230607</enddate><creator>Jawahar, Ganesh</creator><creator>Yang, Haichuan</creator><creator>Xiong, Yunyang</creator><creator>Liu, Zechun</creator><creator>Wang, Dilin</creator><creator>Sun, Fei</creator><creator>Li, Meng</creator><creator>Pappu, Aasish</creator><creator>Oguz, Barlas</creator><creator>Abdul-Mageed, Muhammad</creator><creator>Lakshmanan, Laks V. S</creator><creator>Krishnamoorthi, Raghuraman</creator><creator>Chandra, Vikas</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230607</creationdate><title>Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts</title><author>Jawahar, Ganesh ; Yang, Haichuan ; Xiong, Yunyang ; Liu, Zechun ; Wang, Dilin ; Sun, Fei ; Li, Meng ; Pappu, Aasish ; Oguz, Barlas ; Abdul-Mageed, Muhammad ; Lakshmanan, Laks V. S ; Krishnamoorthi, Raghuraman ; Chandra, Vikas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-cf33fed7f2051ceef322237c5ecfb1dc606fe6688104dd16579edb0754c205e63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Jawahar, Ganesh</creatorcontrib><creatorcontrib>Yang, Haichuan</creatorcontrib><creatorcontrib>Xiong, Yunyang</creatorcontrib><creatorcontrib>Liu, Zechun</creatorcontrib><creatorcontrib>Wang, Dilin</creatorcontrib><creatorcontrib>Sun, Fei</creatorcontrib><creatorcontrib>Li, Meng</creatorcontrib><creatorcontrib>Pappu, Aasish</creatorcontrib><creatorcontrib>Oguz, Barlas</creatorcontrib><creatorcontrib>Abdul-Mageed, Muhammad</creatorcontrib><creatorcontrib>Lakshmanan, Laks V. S</creatorcontrib><creatorcontrib>Krishnamoorthi, Raghuraman</creatorcontrib><creatorcontrib>Chandra, Vikas</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jawahar, Ganesh</au><au>Yang, Haichuan</au><au>Xiong, Yunyang</au><au>Liu, Zechun</au><au>Wang, Dilin</au><au>Sun, Fei</au><au>Li, Meng</au><au>Pappu, Aasish</au><au>Oguz, Barlas</au><au>Abdul-Mageed, Muhammad</au><au>Lakshmanan, Laks V. S</au><au>Krishnamoorthi, Raghuraman</au><au>Chandra, Vikas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts</atitle><date>2023-06-07</date><risdate>2023</risdate><abstract>Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.</abstract><doi>10.48550/arxiv.2306.04845</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2306.04845
ispartof
issn
language eng
recordid cdi_arxiv_primary_2306_04845
source arXiv.org
subjects Computer Science - Computation and Language
title Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T02%3A42%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Mixture-of-Supernets:%20Improving%20Weight-Sharing%20Supernet%20Training%20with%20Architecture-Routed%20Mixture-of-Experts&rft.au=Jawahar,%20Ganesh&rft.date=2023-06-07&rft_id=info:doi/10.48550/arxiv.2306.04845&rft_dat=%3Carxiv_GOX%3E2306_04845%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true