Multi-Path Transformer is Better: A Case Study on Neural Machine Translation

For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-05
Hauptverfasser:	Lin, Ye, Zhou, Shuhan, Li, Yanyang, Ma, Anxiang, Tong, Xiao, Zhu, Jingbo
Format:	Artikel
Sprache:	eng
Schlagworte:	Machine learning Machine translation Mathematical models Parameters Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Lin, Ye Zhou, Shuhan Li, Yanyang Ma, Anxiang Tong, Xiao Zhu, Jingbo
description	For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2812249231</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2812249231</sourcerecordid><originalsourceid>FETCH-proquest_journals_28122492313</originalsourceid><addsrcrecordid>eNqNi0sKwjAUAIMgWLR3eOC60L60Wt1pUVxYEey-BH2lKTHRfBbeXkEP4GoWMzNiEXKeJWWOOGGxc0OaprhYYlHwiB3roLxMzsL30FihXWfsnSxIB1vynuwaNlAJR3Dx4fYCo-FEwQoFtbj2UtP3UsJLo2ds3AnlKP5xyub7XVMdkoc1z0DOt4MJVn9Ui2WGmK-QZ_y_6g3P-jyH</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2812249231</pqid></control><display><type>article</type><title>Multi-Path Transformer is Better: A Case Study on Neural Machine Translation</title><source>Free E- Journals</source><creator>Lin, Ye ; Zhou, Shuhan ; Li, Yanyang ; Ma, Anxiang ; Tong, Xiao ; Zhu, Jingbo</creator><creatorcontrib>Lin, Ye ; Zhou, Shuhan ; Li, Yanyang ; Ma, Anxiang ; Tong, Xiao ; Zhu, Jingbo</creatorcontrib><description>For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Machine learning ; Machine translation ; Mathematical models ; Parameters ; Transformers</subject><ispartof>arXiv.org, 2023-05</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Lin, Ye</creatorcontrib><creatorcontrib>Zhou, Shuhan</creatorcontrib><creatorcontrib>Li, Yanyang</creatorcontrib><creatorcontrib>Ma, Anxiang</creatorcontrib><creatorcontrib>Tong, Xiao</creatorcontrib><creatorcontrib>Zhu, Jingbo</creatorcontrib><title>Multi-Path Transformer is Better: A Case Study on Neural Machine Translation</title><title>arXiv.org</title><description>For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.</description><subject>Machine learning</subject><subject>Machine translation</subject><subject>Mathematical models</subject><subject>Parameters</subject><subject>Transformers</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNi0sKwjAUAIMgWLR3eOC60L60Wt1pUVxYEey-BH2lKTHRfBbeXkEP4GoWMzNiEXKeJWWOOGGxc0OaprhYYlHwiB3roLxMzsL30FihXWfsnSxIB1vynuwaNlAJR3Dx4fYCo-FEwQoFtbj2UtP3UsJLo2ds3AnlKP5xyub7XVMdkoc1z0DOt4MJVn9Ui2WGmK-QZ_y_6g3P-jyH</recordid><startdate>20230510</startdate><enddate>20230510</enddate><creator>Lin, Ye</creator><creator>Zhou, Shuhan</creator><creator>Li, Yanyang</creator><creator>Ma, Anxiang</creator><creator>Tong, Xiao</creator><creator>Zhu, Jingbo</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230510</creationdate><title>Multi-Path Transformer is Better: A Case Study on Neural Machine Translation</title><author>Lin, Ye ; Zhou, Shuhan ; Li, Yanyang ; Ma, Anxiang ; Tong, Xiao ; Zhu, Jingbo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28122492313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Machine learning</topic><topic>Machine translation</topic><topic>Mathematical models</topic><topic>Parameters</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Lin, Ye</creatorcontrib><creatorcontrib>Zhou, Shuhan</creatorcontrib><creatorcontrib>Li, Yanyang</creatorcontrib><creatorcontrib>Ma, Anxiang</creatorcontrib><creatorcontrib>Tong, Xiao</creatorcontrib><creatorcontrib>Zhu, Jingbo</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lin, Ye</au><au>Zhou, Shuhan</au><au>Li, Yanyang</au><au>Ma, Anxiang</au><au>Tong, Xiao</au><au>Zhu, Jingbo</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Multi-Path Transformer is Better: A Case Study on Neural Machine Translation</atitle><jtitle>arXiv.org</jtitle><date>2023-05-10</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-05
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2812249231
source	Free E- Journals
subjects	Machine learning Machine translation Mathematical models Parameters Transformers
title	Multi-Path Transformer is Better: A Case Study on Neural Machine Translation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-30T20%3A40%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Multi-Path%20Transformer%20is%20Better:%20A%20Case%20Study%20on%20Neural%20Machine%20Translation&rft.jtitle=arXiv.org&rft.au=Lin,%20Ye&rft.date=2023-05-10&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2812249231%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2812249231&rft_id=info:pmid/&rfr_iscdi=true