diffGrad: An Optimization Method for Convolutional Neural Networks

Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, ir...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transaction on neural networks and learning systems 2020-11, Vol.31 (11), p.4500-4511
Hauptverfasser: Dubey, Shiv Ram, Chakraborty, Soumendu, Roy, Swalpa Kumar, Mukherjee, Snehasis, Singh, Satish Kumar, Chaudhuri, Bidyut Baran
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 4511
container_issue 11
container_start_page 4500
container_title IEEE transaction on neural networks and learning systems
container_volume 31
creator Dubey, Shiv Ram
Chakraborty, Soumendu
Roy, Swalpa Kumar
Mukherjee, Snehasis
Singh, Satish Kumar
Chaudhuri, Bidyut Baran
description Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .
doi_str_mv 10.1109/TNNLS.2019.2955777
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_2331251733</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8939562</ieee_id><sourcerecordid>2331251733</sourcerecordid><originalsourceid>FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</originalsourceid><addsrcrecordid>eNpdkEtLAzEQgIMoVmr_gIIsePGyNY_Ny1stWoXaHqzgbUmzCW7dbWqyq-ivN33Yg3OZYeabYfgAOEOwjxCU17PJZPzcxxDJPpaUcs4PwAlGDKeYCHG4r_lrB_RCWMAYDFKWyWPQIUiIWNMTcFuU1o68Km6SwTKZrpqyLn9UU7pl8mSaN1ck1vlk6JafrmrXbVUlE9P6TWq-nH8Pp-DIqiqY3i53wcv93Wz4kI6no8fhYJzqDPEmFUgiyqChTGGLFYMaibmEVFOiEYXKGMal0hgXTGaaQWtZVigoFS2wZXxOuuBqe3fl3UdrQpPXZdCmqtTSuDbkmBCEKeKERPTyH7pwrY-_RyqjgmdSShEpvKW0dyF4Y_OVL2vlv3ME87XkfCM5X0vOd5Lj0sXudDuvTbFf-VMagfMtUBpj9mMhiaQMk1-AaH7_</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2458749998</pqid></control><display><type>article</type><title>diffGrad: An Optimization Method for Convolutional Neural Networks</title><source>IEEE Electronic Library (IEL)</source><creator>Dubey, Shiv Ram ; Chakraborty, Soumendu ; Roy, Swalpa Kumar ; Mukherjee, Snehasis ; Singh, Satish Kumar ; Chaudhuri, Bidyut Baran</creator><creatorcontrib>Dubey, Shiv Ram ; Chakraborty, Soumendu ; Roy, Swalpa Kumar ; Mukherjee, Snehasis ; Singh, Satish Kumar ; Chaudhuri, Bidyut Baran</creatorcontrib><description>Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .</description><identifier>ISSN: 2162-237X</identifier><identifier>EISSN: 2162-2388</identifier><identifier>DOI: 10.1109/TNNLS.2019.2955777</identifier><identifier>PMID: 31880565</identifier><identifier>CODEN: ITNNAL</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Adaptive moment estimation (Adam) ; Artificial neural networks ; Computer vision ; Convergence ; difference of gradient ; Estimation ; Experiments ; gradient descent ; image classification ; Machine learning ; Neural networks ; Optimization ; Optimization methods ; Optimization techniques ; Parameters ; residual network ; Source code ; Stochasticity ; Training</subject><ispartof>IEEE transaction on neural networks and learning systems, 2020-11, Vol.31 (11), p.4500-4511</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</citedby><cites>FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</cites><orcidid>0000-0002-4532-8996 ; 0000-0002-8778-8229 ; 0000-0002-6580-3977 ; 0000-0003-0297-8929 ; 0000-0002-8536-4991</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8939562$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8939562$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/31880565$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Dubey, Shiv Ram</creatorcontrib><creatorcontrib>Chakraborty, Soumendu</creatorcontrib><creatorcontrib>Roy, Swalpa Kumar</creatorcontrib><creatorcontrib>Mukherjee, Snehasis</creatorcontrib><creatorcontrib>Singh, Satish Kumar</creatorcontrib><creatorcontrib>Chaudhuri, Bidyut Baran</creatorcontrib><title>diffGrad: An Optimization Method for Convolutional Neural Networks</title><title>IEEE transaction on neural networks and learning systems</title><addtitle>TNNLS</addtitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><description>Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .</description><subject>Adaptive moment estimation (Adam)</subject><subject>Artificial neural networks</subject><subject>Computer vision</subject><subject>Convergence</subject><subject>difference of gradient</subject><subject>Estimation</subject><subject>Experiments</subject><subject>gradient descent</subject><subject>image classification</subject><subject>Machine learning</subject><subject>Neural networks</subject><subject>Optimization</subject><subject>Optimization methods</subject><subject>Optimization techniques</subject><subject>Parameters</subject><subject>residual network</subject><subject>Source code</subject><subject>Stochasticity</subject><subject>Training</subject><issn>2162-237X</issn><issn>2162-2388</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkEtLAzEQgIMoVmr_gIIsePGyNY_Ny1stWoXaHqzgbUmzCW7dbWqyq-ivN33Yg3OZYeabYfgAOEOwjxCU17PJZPzcxxDJPpaUcs4PwAlGDKeYCHG4r_lrB_RCWMAYDFKWyWPQIUiIWNMTcFuU1o68Km6SwTKZrpqyLn9UU7pl8mSaN1ck1vlk6JafrmrXbVUlE9P6TWq-nH8Pp-DIqiqY3i53wcv93Wz4kI6no8fhYJzqDPEmFUgiyqChTGGLFYMaibmEVFOiEYXKGMal0hgXTGaaQWtZVigoFS2wZXxOuuBqe3fl3UdrQpPXZdCmqtTSuDbkmBCEKeKERPTyH7pwrY-_RyqjgmdSShEpvKW0dyF4Y_OVL2vlv3ME87XkfCM5X0vOd5Lj0sXudDuvTbFf-VMagfMtUBpj9mMhiaQMk1-AaH7_</recordid><startdate>20201101</startdate><enddate>20201101</enddate><creator>Dubey, Shiv Ram</creator><creator>Chakraborty, Soumendu</creator><creator>Roy, Swalpa Kumar</creator><creator>Mukherjee, Snehasis</creator><creator>Singh, Satish Kumar</creator><creator>Chaudhuri, Bidyut Baran</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QP</scope><scope>7QQ</scope><scope>7QR</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7TK</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-4532-8996</orcidid><orcidid>https://orcid.org/0000-0002-8778-8229</orcidid><orcidid>https://orcid.org/0000-0002-6580-3977</orcidid><orcidid>https://orcid.org/0000-0003-0297-8929</orcidid><orcidid>https://orcid.org/0000-0002-8536-4991</orcidid></search><sort><creationdate>20201101</creationdate><title>diffGrad: An Optimization Method for Convolutional Neural Networks</title><author>Dubey, Shiv Ram ; Chakraborty, Soumendu ; Roy, Swalpa Kumar ; Mukherjee, Snehasis ; Singh, Satish Kumar ; Chaudhuri, Bidyut Baran</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Adaptive moment estimation (Adam)</topic><topic>Artificial neural networks</topic><topic>Computer vision</topic><topic>Convergence</topic><topic>difference of gradient</topic><topic>Estimation</topic><topic>Experiments</topic><topic>gradient descent</topic><topic>image classification</topic><topic>Machine learning</topic><topic>Neural networks</topic><topic>Optimization</topic><topic>Optimization methods</topic><topic>Optimization techniques</topic><topic>Parameters</topic><topic>residual network</topic><topic>Source code</topic><topic>Stochasticity</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Dubey, Shiv Ram</creatorcontrib><creatorcontrib>Chakraborty, Soumendu</creatorcontrib><creatorcontrib>Roy, Swalpa Kumar</creatorcontrib><creatorcontrib>Mukherjee, Snehasis</creatorcontrib><creatorcontrib>Singh, Satish Kumar</creatorcontrib><creatorcontrib>Chaudhuri, Bidyut Baran</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium &amp; Calcified Tissue Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transaction on neural networks and learning systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Dubey, Shiv Ram</au><au>Chakraborty, Soumendu</au><au>Roy, Swalpa Kumar</au><au>Mukherjee, Snehasis</au><au>Singh, Satish Kumar</au><au>Chaudhuri, Bidyut Baran</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>diffGrad: An Optimization Method for Convolutional Neural Networks</atitle><jtitle>IEEE transaction on neural networks and learning systems</jtitle><stitle>TNNLS</stitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><date>2020-11-01</date><risdate>2020</risdate><volume>31</volume><issue>11</issue><spage>4500</spage><epage>4511</epage><pages>4500-4511</pages><issn>2162-237X</issn><eissn>2162-2388</eissn><coden>ITNNAL</coden><abstract>Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .</abstract><cop>United States</cop><pub>IEEE</pub><pmid>31880565</pmid><doi>10.1109/TNNLS.2019.2955777</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-4532-8996</orcidid><orcidid>https://orcid.org/0000-0002-8778-8229</orcidid><orcidid>https://orcid.org/0000-0002-6580-3977</orcidid><orcidid>https://orcid.org/0000-0003-0297-8929</orcidid><orcidid>https://orcid.org/0000-0002-8536-4991</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2162-237X
ispartof IEEE transaction on neural networks and learning systems, 2020-11, Vol.31 (11), p.4500-4511
issn 2162-237X
2162-2388
language eng
recordid cdi_proquest_miscellaneous_2331251733
source IEEE Electronic Library (IEL)
subjects Adaptive moment estimation (Adam)
Artificial neural networks
Computer vision
Convergence
difference of gradient
Estimation
Experiments
gradient descent
image classification
Machine learning
Neural networks
Optimization
Optimization methods
Optimization techniques
Parameters
residual network
Source code
Stochasticity
Training
title diffGrad: An Optimization Method for Convolutional Neural Networks
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T22%3A55%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=diffGrad:%20An%20Optimization%20Method%20for%20Convolutional%20Neural%20Networks&rft.jtitle=IEEE%20transaction%20on%20neural%20networks%20and%20learning%20systems&rft.au=Dubey,%20Shiv%20Ram&rft.date=2020-11-01&rft.volume=31&rft.issue=11&rft.spage=4500&rft.epage=4511&rft.pages=4500-4511&rft.issn=2162-237X&rft.eissn=2162-2388&rft.coden=ITNNAL&rft_id=info:doi/10.1109/TNNLS.2019.2955777&rft_dat=%3Cproquest_RIE%3E2331251733%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2458749998&rft_id=info:pmid/31880565&rft_ieee_id=8939562&rfr_iscdi=true