Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference

Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Kudugunta, Sneha, Huang, Yanping, Bapna, Ankur, Krikun, Maxim, Lepikhin, Dmitry, Luong, Minh-Thang, Firat, Orhan
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Kudugunta, Sneha Huang, Yanping Bapna, Ankur Krikun, Maxim Lepikhin, Dmitry Luong, Minh-Thang Firat, Orhan
description	Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x.
doi_str_mv	10.48550/arxiv.2110.03742
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2110_03742</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2110_03742</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-4d0872f5044f1a90d1f7a1db8f561eefa132a3a7d7b338ede328261fed82549c3</originalsourceid><addsrcrecordid>eNotz71OwzAUhmEvDKhwAUz4Blz8l9hlgxKgqIgle3QanyNZGKdyQpXePVA6fdI7fNLD2I2SS-urSt5BmeNhqdVvkMZZfcneHvE45MCf4jjFlGCKQ77nLYyfIuEBE3-P8_RdUAwkmnmPZRo5DYU3RLGPmCe-yYQFc49X7IIgjXh93gVrn5t2_Sq2Hy-b9cNWQO20sEF6p6mS1pKClQyKHKiw81TVCpFAGQ0GXHA7YzwGNNrrWhEGryu76s2C3f7fnjDdvsQvKMfuD9WdUOYHQ7xHdA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference</title><source>arXiv.org</source><creator>Kudugunta, Sneha ; Huang, Yanping ; Bapna, Ankur ; Krikun, Maxim ; Lepikhin, Dmitry ; Luong, Minh-Thang ; Firat, Orhan</creator><creatorcontrib>Kudugunta, Sneha ; Huang, Yanping ; Bapna, Ankur ; Krikun, Maxim ; Lepikhin, Dmitry ; Luong, Minh-Thang ; Firat, Orhan</creatorcontrib><description>Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x.</description><identifier>DOI: 10.48550/arxiv.2110.03742</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2021-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2110.03742$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2110.03742$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kudugunta, Sneha</creatorcontrib><creatorcontrib>Huang, Yanping</creatorcontrib><creatorcontrib>Bapna, Ankur</creatorcontrib><creatorcontrib>Krikun, Maxim</creatorcontrib><creatorcontrib>Lepikhin, Dmitry</creatorcontrib><creatorcontrib>Luong, Minh-Thang</creatorcontrib><creatorcontrib>Firat, Orhan</creatorcontrib><title>Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference</title><description>Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAUhmEvDKhwAUz4Blz8l9hlgxKgqIgle3QanyNZGKdyQpXePVA6fdI7fNLD2I2SS-urSt5BmeNhqdVvkMZZfcneHvE45MCf4jjFlGCKQ77nLYyfIuEBE3-P8_RdUAwkmnmPZRo5DYU3RLGPmCe-yYQFc49X7IIgjXh93gVrn5t2_Sq2Hy-b9cNWQO20sEF6p6mS1pKClQyKHKiw81TVCpFAGQ0GXHA7YzwGNNrrWhEGryu76s2C3f7fnjDdvsQvKMfuD9WdUOYHQ7xHdA</recordid><startdate>20210924</startdate><enddate>20210924</enddate><creator>Kudugunta, Sneha</creator><creator>Huang, Yanping</creator><creator>Bapna, Ankur</creator><creator>Krikun, Maxim</creator><creator>Lepikhin, Dmitry</creator><creator>Luong, Minh-Thang</creator><creator>Firat, Orhan</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210924</creationdate><title>Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference</title><author>Kudugunta, Sneha ; Huang, Yanping ; Bapna, Ankur ; Krikun, Maxim ; Lepikhin, Dmitry ; Luong, Minh-Thang ; Firat, Orhan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-4d0872f5044f1a90d1f7a1db8f561eefa132a3a7d7b338ede328261fed82549c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Kudugunta, Sneha</creatorcontrib><creatorcontrib>Huang, Yanping</creatorcontrib><creatorcontrib>Bapna, Ankur</creatorcontrib><creatorcontrib>Krikun, Maxim</creatorcontrib><creatorcontrib>Lepikhin, Dmitry</creatorcontrib><creatorcontrib>Luong, Minh-Thang</creatorcontrib><creatorcontrib>Firat, Orhan</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kudugunta, Sneha</au><au>Huang, Yanping</au><au>Bapna, Ankur</au><au>Krikun, Maxim</au><au>Lepikhin, Dmitry</au><au>Luong, Minh-Thang</au><au>Firat, Orhan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference</atitle><date>2021-09-24</date><risdate>2021</risdate><abstract>Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation. However, MoE models are prohibitively large and practitioners often resort to methods such as distillation for serving. In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation. Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models. On WMT, our task-MoE with 32 experts (533M parameters) outperforms the best performing token-level MoE model (token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak inference throughput is also improved by a factor of 1.9x when we route by tasks instead of tokens. While distilling a token-MoE to a smaller dense model preserves only 32% of the BLEU gains, our sub-network task-MoE, by design, preserves all the gains with the same inference cost as the distilled student model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE (13B parameters) performs competitively with a token-level counterpart, while improving the peak inference throughput by a factor of 2.6x.</abstract><doi>10.48550/arxiv.2110.03742</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2110.03742
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2110_03742
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Learning
title	Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T15%3A35%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Beyond%20Distillation:%20Task-level%20Mixture-of-Experts%20for%20Efficient%20Inference&rft.au=Kudugunta,%20Sneha&rft.date=2021-09-24&rft_id=info:doi/10.48550/arxiv.2110.03742&rft_dat=%3Carxiv_GOX%3E2110_03742%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true