Multi-Path Transformer is Better: A Case Study on Neural Machine Translation

For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Lin, Ye, Zhou, Shuhan, Li, Yanyang, Ma, Anxiang, Xiao, Tong, Zhu, Jingbo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Lin, Ye
Zhou, Shuhan
Li, Yanyang
Ma, Anxiang
Xiao, Tong
Zhu, Jingbo
description For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.
doi_str_mv 10.48550/arxiv.2305.05948
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2305_05948</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2305_05948</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-2010de1ba5bd9ab40b0894bfd9b2f7ba7ece3133a9f5008bf3b7cbaa114fb84f3</originalsourceid><addsrcrecordid>eNotz71OwzAUhmEvDKhwAUz1DSTYsU1sthLxJ6VQiezROcmxailNkOMgeveIttM3vZ_0MHYnRa6tMeIe4m_4yQslTC6M0_aa1dtlSCHbQdrzJsI4-ykeKPIw8ydKieIj3_AKZuJfaemPfBr5By0RBr6Fbh9GOlcDpDCNN-zKwzDT7WVXrHl5bqq3rP58fa82dQYPpc0KIUVPEsFg7wC1QGGdRt87LHyJUFJHSioFzhshLHqFZYcAUmqPVnu1Yuvz7YnTfsdwgHhs_1ntiaX-AKXhSIA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Multi-Path Transformer is Better: A Case Study on Neural Machine Translation</title><source>arXiv.org</source><creator>Lin, Ye ; Zhou, Shuhan ; Li, Yanyang ; Ma, Anxiang ; Xiao, Tong ; Zhu, Jingbo</creator><creatorcontrib>Lin, Ye ; Zhou, Shuhan ; Li, Yanyang ; Ma, Anxiang ; Xiao, Tong ; Zhu, Jingbo</creatorcontrib><description>For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.</description><identifier>DOI: 10.48550/arxiv.2305.05948</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language</subject><creationdate>2023-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,883</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2305.05948$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.05948$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lin, Ye</creatorcontrib><creatorcontrib>Zhou, Shuhan</creatorcontrib><creatorcontrib>Li, Yanyang</creatorcontrib><creatorcontrib>Ma, Anxiang</creatorcontrib><creatorcontrib>Xiao, Tong</creatorcontrib><creatorcontrib>Zhu, Jingbo</creatorcontrib><title>Multi-Path Transformer is Better: A Case Study on Neural Machine Translation</title><description>For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz71OwzAUhmEvDKhwAUz1DSTYsU1sthLxJ6VQiezROcmxailNkOMgeveIttM3vZ_0MHYnRa6tMeIe4m_4yQslTC6M0_aa1dtlSCHbQdrzJsI4-ykeKPIw8ydKieIj3_AKZuJfaemPfBr5By0RBr6Fbh9GOlcDpDCNN-zKwzDT7WVXrHl5bqq3rP58fa82dQYPpc0KIUVPEsFg7wC1QGGdRt87LHyJUFJHSioFzhshLHqFZYcAUmqPVnu1Yuvz7YnTfsdwgHhs_1ntiaX-AKXhSIA</recordid><startdate>20230510</startdate><enddate>20230510</enddate><creator>Lin, Ye</creator><creator>Zhou, Shuhan</creator><creator>Li, Yanyang</creator><creator>Ma, Anxiang</creator><creator>Xiao, Tong</creator><creator>Zhu, Jingbo</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230510</creationdate><title>Multi-Path Transformer is Better: A Case Study on Neural Machine Translation</title><author>Lin, Ye ; Zhou, Shuhan ; Li, Yanyang ; Ma, Anxiang ; Xiao, Tong ; Zhu, Jingbo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-2010de1ba5bd9ab40b0894bfd9b2f7ba7ece3133a9f5008bf3b7cbaa114fb84f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><toplevel>online_resources</toplevel><creatorcontrib>Lin, Ye</creatorcontrib><creatorcontrib>Zhou, Shuhan</creatorcontrib><creatorcontrib>Li, Yanyang</creatorcontrib><creatorcontrib>Ma, Anxiang</creatorcontrib><creatorcontrib>Xiao, Tong</creatorcontrib><creatorcontrib>Zhu, Jingbo</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lin, Ye</au><au>Zhou, Shuhan</au><au>Li, Yanyang</au><au>Ma, Anxiang</au><au>Xiao, Tong</au><au>Zhu, Jingbo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multi-Path Transformer is Better: A Case Study on Neural Machine Translation</atitle><date>2023-05-10</date><risdate>2023</risdate><abstract>For years the model performance in machine learning obeyed a power-law relationship with the model size. For the consideration of parameter efficiency, recent studies focus on increasing model depth rather than width to achieve better performance. In this paper, we study how model width affects the Transformer model through a parameter-efficient multi-path structure. To better fuse features extracted from different paths, we add three additional operations to each sublayer: a normalization at the end of each path, a cheap operation to produce more features, and a learnable weighted mechanism to fuse all features flexibly. Extensive experiments on 12 WMT machine translation tasks show that, with the same number of parameters, the shallower multi-path model can achieve similar or even better performance than the deeper model. It reveals that we should pay more attention to the multi-path structure, and there should be a balance between the model depth and width to train a better large-scale Transformer.</abstract><doi>10.48550/arxiv.2305.05948</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2305.05948
ispartof
issn
language eng
recordid cdi_arxiv_primary_2305_05948
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
title Multi-Path Transformer is Better: A Case Study on Neural Machine Translation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T20%3A30%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multi-Path%20Transformer%20is%20Better:%20A%20Case%20Study%20on%20Neural%20Machine%20Translation&rft.au=Lin,%20Ye&rft.date=2023-05-10&rft_id=info:doi/10.48550/arxiv.2305.05948&rft_dat=%3Carxiv_GOX%3E2305_05948%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true