Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts

Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine t...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jawahar, Ganesh, Yang, Haichuan, Xiong, Yunyang, Liu, Zechun, Wang, Dilin, Sun, Fei, Li, Meng, Pappu, Aasish, Oguz, Barlas, Abdul-Mageed, Muhammad, Lakshmanan, Laks V. S, Krishnamoorthi, Raghuraman, Chandra, Vikas
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Jawahar, Ganesh Yang, Haichuan Xiong, Yunyang Liu, Zechun Wang, Dilin Sun, Fei Li, Meng Pappu, Aasish Oguz, Barlas Abdul-Mageed, Muhammad Lakshmanan, Laks V. S Krishnamoorthi, Raghuraman Chandra, Vikas
description	Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.
doi_str_mv	10.48550/arxiv.2306.04845
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_04845</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_04845</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-cf33fed7f2051ceef322237c5ecfb1dc606fe6688104dd16579edb0754c205e63</originalsourceid><addsrcrecordid>eNpNj81Kw0AUhWfjQqoP4Kp5gYmT-Y3uSqlaaBFsoMswnbnTDLRJmExqfHubqODqcA_3fPAh9JCRlOdCkEcdBn9JKSMyJTzn4hadtn6IfQDcOLzrWwg1xO45WZ_b0Fx8fUz24I9VxLtKh_H8-0mKoH09Np8-VskimMpHMBPqo-kj2OQfeTVcV7G7QzdOnzq4_80ZKl5WxfINb95f18vFBmupBDaOMQdWOUpEZgAco5QyZQQYd8iskUQ6kDLPM8KtzaRQT2APRAlurguQbIbmP9hJt2yDP-vwVY7a5aTNvgFM9FVM</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts</title><source>arXiv.org</source><creator>Jawahar, Ganesh ; Yang, Haichuan ; Xiong, Yunyang ; Liu, Zechun ; Wang, Dilin ; Sun, Fei ; Li, Meng ; Pappu, Aasish ; Oguz, Barlas ; Abdul-Mageed, Muhammad ; Lakshmanan, Laks V. S ; Krishnamoorthi, Raghuraman ; Chandra, Vikas</creator><creatorcontrib>Jawahar, Ganesh ; Yang, Haichuan ; Xiong, Yunyang ; Liu, Zechun ; Wang, Dilin ; Sun, Fei ; Li, Meng ; Pappu, Aasish ; Oguz, Barlas ; Abdul-Mageed, Muhammad ; Lakshmanan, Laks V. S ; Krishnamoorthi, Raghuraman ; Chandra, Vikas</creatorcontrib><description>Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.</description><identifier>DOI: 10.48550/arxiv.2306.04845</identifier><language>eng</language><subject>Computer Science - Computation and Language</subject><creationdate>2023-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.04845$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.04845$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jawahar, Ganesh</creatorcontrib><creatorcontrib>Yang, Haichuan</creatorcontrib><creatorcontrib>Xiong, Yunyang</creatorcontrib><creatorcontrib>Liu, Zechun</creatorcontrib><creatorcontrib>Wang, Dilin</creatorcontrib><creatorcontrib>Sun, Fei</creatorcontrib><creatorcontrib>Li, Meng</creatorcontrib><creatorcontrib>Pappu, Aasish</creatorcontrib><creatorcontrib>Oguz, Barlas</creatorcontrib><creatorcontrib>Abdul-Mageed, Muhammad</creatorcontrib><creatorcontrib>Lakshmanan, Laks V. S</creatorcontrib><creatorcontrib>Krishnamoorthi, Raghuraman</creatorcontrib><creatorcontrib>Chandra, Vikas</creatorcontrib><title>Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts</title><description>Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.</description><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpNj81Kw0AUhWfjQqoP4Kp5gYmT-Y3uSqlaaBFsoMswnbnTDLRJmExqfHubqODqcA_3fPAh9JCRlOdCkEcdBn9JKSMyJTzn4hadtn6IfQDcOLzrWwg1xO45WZ_b0Fx8fUz24I9VxLtKh_H8-0mKoH09Np8-VskimMpHMBPqo-kj2OQfeTVcV7G7QzdOnzq4_80ZKl5WxfINb95f18vFBmupBDaOMQdWOUpEZgAco5QyZQQYd8iskUQ6kDLPM8KtzaRQT2APRAlurguQbIbmP9hJt2yDP-vwVY7a5aTNvgFM9FVM</recordid><startdate>20230607</startdate><enddate>20230607</enddate><creator>Jawahar, Ganesh</creator><creator>Yang, Haichuan</creator><creator>Xiong, Yunyang</creator><creator>Liu, Zechun</creator><creator>Wang, Dilin</creator><creator>Sun, Fei</creator><creator>Li, Meng</creator><creator>Pappu, Aasish</creator><creator>Oguz, Barlas</creator><creator>Abdul-Mageed, Muhammad</creator><creator>Lakshmanan, Laks V. S</creator><creator>Krishnamoorthi, Raghuraman</creator><creator>Chandra, Vikas</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230607</creationdate><title>Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts</title><author>Jawahar, Ganesh ; Yang, Haichuan ; Xiong, Yunyang ; Liu, Zechun ; Wang, Dilin ; Sun, Fei ; Li, Meng ; Pappu, Aasish ; Oguz, Barlas ; Abdul-Mageed, Muhammad ; Lakshmanan, Laks V. S ; Krishnamoorthi, Raghuraman ; Chandra, Vikas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-cf33fed7f2051ceef322237c5ecfb1dc606fe6688104dd16579edb0754c205e63</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Jawahar, Ganesh</creatorcontrib><creatorcontrib>Yang, Haichuan</creatorcontrib><creatorcontrib>Xiong, Yunyang</creatorcontrib><creatorcontrib>Liu, Zechun</creatorcontrib><creatorcontrib>Wang, Dilin</creatorcontrib><creatorcontrib>Sun, Fei</creatorcontrib><creatorcontrib>Li, Meng</creatorcontrib><creatorcontrib>Pappu, Aasish</creatorcontrib><creatorcontrib>Oguz, Barlas</creatorcontrib><creatorcontrib>Abdul-Mageed, Muhammad</creatorcontrib><creatorcontrib>Lakshmanan, Laks V. S</creatorcontrib><creatorcontrib>Krishnamoorthi, Raghuraman</creatorcontrib><creatorcontrib>Chandra, Vikas</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jawahar, Ganesh</au><au>Yang, Haichuan</au><au>Xiong, Yunyang</au><au>Liu, Zechun</au><au>Wang, Dilin</au><au>Sun, Fei</au><au>Li, Meng</au><au>Pappu, Aasish</au><au>Oguz, Barlas</au><au>Abdul-Mageed, Muhammad</au><au>Lakshmanan, Laks V. S</au><au>Krishnamoorthi, Raghuraman</au><au>Chandra, Vikas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts</atitle><date>2023-06-07</date><risdate>2023</risdate><abstract>Weight-sharing supernets are crucial for performance estimation in cutting-edge neural architecture search (NAS) frameworks. Despite their ability to generate diverse subnetworks without retraining, the quality of these subnetworks is not guaranteed due to weight sharing. In NLP tasks like machine translation and pre-trained language modeling, there is a significant performance gap between supernet and training from scratch for the same model architecture, necessitating retraining post optimal architecture identification. This study introduces a solution called mixture-of-supernets, a generalized supernet formulation leveraging mixture-of-experts (MoE) to enhance supernet model expressiveness with minimal training overhead. Unlike conventional supernets, this method employs an architecture-based routing mechanism, enabling indirect sharing of model weights among subnetworks. This customization of weights for specific architectures, learned through gradient descent, minimizes retraining time, significantly enhancing training efficiency in NLP. The proposed method attains state-of-the-art (SoTA) performance in NAS for fast machine translation models, exhibiting a superior latency-BLEU tradeoff compared to HAT, the SoTA NAS framework for machine translation. Furthermore, it excels in NAS for building memory-efficient task-agnostic BERT models, surpassing NAS-BERT and AutoDistil across various model sizes. The code can be found at: https://github.com/UBC-NLP/MoS.</abstract><doi>10.48550/arxiv.2306.04845</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2306.04845
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2306_04845
source	arXiv.org
subjects	Computer Science - Computation and Language
title	Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with Architecture-Routed Mixture-of-Experts
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-01T02%3A42%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Mixture-of-Supernets:%20Improving%20Weight-Sharing%20Supernet%20Training%20with%20Architecture-Routed%20Mixture-of-Experts&rft.au=Jawahar,%20Ganesh&rft.date=2023-06-07&rft_id=info:doi/10.48550/arxiv.2306.04845&rft_dat=%3Carxiv_GOX%3E2306_04845%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true