Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks
We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation, and language modeling, it performs on par or better than wel...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Ginsburg, Boris Castonguay, Patrice Hrinchuk, Oleksii Kuchaiev, Oleksii Lavrukhin, Vitaly Leary, Ryan Li, Jason Nguyen, Huyen Zhang, Yang Cohen, Jonathan M |
description | We propose NovoGrad, an adaptive stochastic gradient descent method with
layer-wise gradient normalization and decoupled weight decay. In our
experiments on neural networks for image classification, speech recognition,
machine translation, and language modeling, it performs on par or better than
well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is
robust to the choice of learning rate and weight initialization, (2) works well
in a large batch setting, and (3) has two times smaller memory footprint than
Adam. |
doi_str_mv | 10.48550/arxiv.1905.11286 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_1905_11286</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1905_11286</sourcerecordid><originalsourceid>FETCH-LOGICAL-a676-c47771729182860641c08b330aa2fbc816b38487dca7f04f3998acae701e4f0f3</originalsourceid><addsrcrecordid>eNotz7FOwzAYBGAvDKjwAEz8L5Bgx07sjFWBgpTCQCaW6I9jEwsaR7bV0LenFKZbTqf7CLlhNBeqLOkdhm93yFlNy5yxQlWX5P0teT1iTE7DNuDgzJRgZ9LohwiLSyM0eDQhW1w0sB5wTu5gYOf3p14E6wO0Ad3kpg_wFu6NmeHFpMWHz3hFLix-RXP9nyvSPj60m6esed0-b9ZNhpWsMi2klEwWNVOnQ7QSTFPVc04RC9trxaqeK6HkoFFaKiyva4UajaTMCEstX5Hbv9kzrpuD22M4dr_I7ozkP4QXTLM</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks</title><source>arXiv.org</source><creator>Ginsburg, Boris ; Castonguay, Patrice ; Hrinchuk, Oleksii ; Kuchaiev, Oleksii ; Lavrukhin, Vitaly ; Leary, Ryan ; Li, Jason ; Nguyen, Huyen ; Zhang, Yang ; Cohen, Jonathan M</creator><creatorcontrib>Ginsburg, Boris ; Castonguay, Patrice ; Hrinchuk, Oleksii ; Kuchaiev, Oleksii ; Lavrukhin, Vitaly ; Leary, Ryan ; Li, Jason ; Nguyen, Huyen ; Zhang, Yang ; Cohen, Jonathan M</creatorcontrib><description>We propose NovoGrad, an adaptive stochastic gradient descent method with
layer-wise gradient normalization and decoupled weight decay. In our
experiments on neural networks for image classification, speech recognition,
machine translation, and language modeling, it performs on par or better than
well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is
robust to the choice of learning rate and weight initialization, (2) works well
in a large batch setting, and (3) has two times smaller memory footprint than
Adam.</description><identifier>DOI: 10.48550/arxiv.1905.11286</identifier><language>eng</language><subject>Computer Science - Learning ; Statistics - Machine Learning</subject><creationdate>2019-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/1905.11286$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.1905.11286$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Ginsburg, Boris</creatorcontrib><creatorcontrib>Castonguay, Patrice</creatorcontrib><creatorcontrib>Hrinchuk, Oleksii</creatorcontrib><creatorcontrib>Kuchaiev, Oleksii</creatorcontrib><creatorcontrib>Lavrukhin, Vitaly</creatorcontrib><creatorcontrib>Leary, Ryan</creatorcontrib><creatorcontrib>Li, Jason</creatorcontrib><creatorcontrib>Nguyen, Huyen</creatorcontrib><creatorcontrib>Zhang, Yang</creatorcontrib><creatorcontrib>Cohen, Jonathan M</creatorcontrib><title>Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks</title><description>We propose NovoGrad, an adaptive stochastic gradient descent method with
layer-wise gradient normalization and decoupled weight decay. In our
experiments on neural networks for image classification, speech recognition,
machine translation, and language modeling, it performs on par or better than
well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is
robust to the choice of learning rate and weight initialization, (2) works well
in a large batch setting, and (3) has two times smaller memory footprint than
Adam.</description><subject>Computer Science - Learning</subject><subject>Statistics - Machine Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7FOwzAYBGAvDKjwAEz8L5Bgx07sjFWBgpTCQCaW6I9jEwsaR7bV0LenFKZbTqf7CLlhNBeqLOkdhm93yFlNy5yxQlWX5P0teT1iTE7DNuDgzJRgZ9LohwiLSyM0eDQhW1w0sB5wTu5gYOf3p14E6wO0Ad3kpg_wFu6NmeHFpMWHz3hFLix-RXP9nyvSPj60m6esed0-b9ZNhpWsMi2klEwWNVOnQ7QSTFPVc04RC9trxaqeK6HkoFFaKiyva4UajaTMCEstX5Hbv9kzrpuD22M4dr_I7ozkP4QXTLM</recordid><startdate>20190527</startdate><enddate>20190527</enddate><creator>Ginsburg, Boris</creator><creator>Castonguay, Patrice</creator><creator>Hrinchuk, Oleksii</creator><creator>Kuchaiev, Oleksii</creator><creator>Lavrukhin, Vitaly</creator><creator>Leary, Ryan</creator><creator>Li, Jason</creator><creator>Nguyen, Huyen</creator><creator>Zhang, Yang</creator><creator>Cohen, Jonathan M</creator><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20190527</creationdate><title>Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks</title><author>Ginsburg, Boris ; Castonguay, Patrice ; Hrinchuk, Oleksii ; Kuchaiev, Oleksii ; Lavrukhin, Vitaly ; Leary, Ryan ; Li, Jason ; Nguyen, Huyen ; Zhang, Yang ; Cohen, Jonathan M</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a676-c47771729182860641c08b330aa2fbc816b38487dca7f04f3998acae701e4f0f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Computer Science - Learning</topic><topic>Statistics - Machine Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Ginsburg, Boris</creatorcontrib><creatorcontrib>Castonguay, Patrice</creatorcontrib><creatorcontrib>Hrinchuk, Oleksii</creatorcontrib><creatorcontrib>Kuchaiev, Oleksii</creatorcontrib><creatorcontrib>Lavrukhin, Vitaly</creatorcontrib><creatorcontrib>Leary, Ryan</creatorcontrib><creatorcontrib>Li, Jason</creatorcontrib><creatorcontrib>Nguyen, Huyen</creatorcontrib><creatorcontrib>Zhang, Yang</creatorcontrib><creatorcontrib>Cohen, Jonathan M</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Ginsburg, Boris</au><au>Castonguay, Patrice</au><au>Hrinchuk, Oleksii</au><au>Kuchaiev, Oleksii</au><au>Lavrukhin, Vitaly</au><au>Leary, Ryan</au><au>Li, Jason</au><au>Nguyen, Huyen</au><au>Zhang, Yang</au><au>Cohen, Jonathan M</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks</atitle><date>2019-05-27</date><risdate>2019</risdate><abstract>We propose NovoGrad, an adaptive stochastic gradient descent method with
layer-wise gradient normalization and decoupled weight decay. In our
experiments on neural networks for image classification, speech recognition,
machine translation, and language modeling, it performs on par or better than
well tuned SGD with momentum and Adam or AdamW. Additionally, NovoGrad (1) is
robust to the choice of learning rate and weight initialization, (2) works well
in a large batch setting, and (3) has two times smaller memory footprint than
Adam.</abstract><doi>10.48550/arxiv.1905.11286</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.1905.11286 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_1905_11286 |
source | arXiv.org |
subjects | Computer Science - Learning Statistics - Machine Learning |
title | Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T07%3A35%3A20IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stochastic%20Gradient%20Methods%20with%20Layer-wise%20Adaptive%20Moments%20for%20Training%20of%20Deep%20Networks&rft.au=Ginsburg,%20Boris&rft.date=2019-05-27&rft_id=info:doi/10.48550/arxiv.1905.11286&rft_dat=%3Carxiv_GOX%3E1905_11286%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |