diffGrad: An Optimization Method for Convolutional Neural Networks

Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, ir...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transaction on neural networks and learning systems 2020-11, Vol.31 (11), p.4500-4511
Hauptverfasser:	Dubey, Shiv Ram, Chakraborty, Soumendu, Roy, Swalpa Kumar, Mukherjee, Snehasis, Singh, Satish Kumar, Chaudhuri, Bidyut Baran
Format:	Artikel
Sprache:	eng
Schlagworte:	Adaptive moment estimation (Adam) Artificial neural networks Computer vision Convergence difference of gradient Estimation Experiments gradient descent image classification Machine learning Neural networks Optimization Optimization methods Optimization techniques Parameters residual network Source code Stochasticity Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	4511
container_issue	11
container_start_page	4500
container_title	IEEE transaction on neural networks and learning systems
container_volume	31
creator	Dubey, Shiv Ram Chakraborty, Soumendu Roy, Swalpa Kumar Mukherjee, Snehasis Singh, Satish Kumar Chaudhuri, Bidyut Baran
description	Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .
doi_str_mv	10.1109/TNNLS.2019.2955777
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_2331251733</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8939562</ieee_id><sourcerecordid>2331251733</sourcerecordid><originalsourceid>FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</originalsourceid><addsrcrecordid>eNpdkEtLAzEQgIMoVmr_gIIsePGyNY_Ny1stWoXaHqzgbUmzCW7dbWqyq-ivN33Yg3OZYeabYfgAOEOwjxCU17PJZPzcxxDJPpaUcs4PwAlGDKeYCHG4r_lrB_RCWMAYDFKWyWPQIUiIWNMTcFuU1o68Km6SwTKZrpqyLn9UU7pl8mSaN1ck1vlk6JafrmrXbVUlE9P6TWq-nH8Pp-DIqiqY3i53wcv93Wz4kI6no8fhYJzqDPEmFUgiyqChTGGLFYMaibmEVFOiEYXKGMal0hgXTGaaQWtZVigoFS2wZXxOuuBqe3fl3UdrQpPXZdCmqtTSuDbkmBCEKeKERPTyH7pwrY-_RyqjgmdSShEpvKW0dyF4Y_OVL2vlv3ME87XkfCM5X0vOd5Lj0sXudDuvTbFf-VMagfMtUBpj9mMhiaQMk1-AaH7_</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2458749998</pqid></control><display><type>article</type><title>diffGrad: An Optimization Method for Convolutional Neural Networks</title><source>IEEE Electronic Library (IEL)</source><creator>Dubey, Shiv Ram ; Chakraborty, Soumendu ; Roy, Swalpa Kumar ; Mukherjee, Snehasis ; Singh, Satish Kumar ; Chaudhuri, Bidyut Baran</creator><creatorcontrib>Dubey, Shiv Ram ; Chakraborty, Soumendu ; Roy, Swalpa Kumar ; Mukherjee, Snehasis ; Singh, Satish Kumar ; Chaudhuri, Bidyut Baran</creatorcontrib><description>Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .</description><identifier>ISSN: 2162-237X</identifier><identifier>EISSN: 2162-2388</identifier><identifier>DOI: 10.1109/TNNLS.2019.2955777</identifier><identifier>PMID: 31880565</identifier><identifier>CODEN: ITNNAL</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Adaptive moment estimation (Adam) ; Artificial neural networks ; Computer vision ; Convergence ; difference of gradient ; Estimation ; Experiments ; gradient descent ; image classification ; Machine learning ; Neural networks ; Optimization ; Optimization methods ; Optimization techniques ; Parameters ; residual network ; Source code ; Stochasticity ; Training</subject><ispartof>IEEE transaction on neural networks and learning systems, 2020-11, Vol.31 (11), p.4500-4511</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</citedby><cites>FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</cites><orcidid>0000-0002-4532-8996 ; 0000-0002-8778-8229 ; 0000-0002-6580-3977 ; 0000-0003-0297-8929 ; 0000-0002-8536-4991</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8939562$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8939562$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/31880565$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Dubey, Shiv Ram</creatorcontrib><creatorcontrib>Chakraborty, Soumendu</creatorcontrib><creatorcontrib>Roy, Swalpa Kumar</creatorcontrib><creatorcontrib>Mukherjee, Snehasis</creatorcontrib><creatorcontrib>Singh, Satish Kumar</creatorcontrib><creatorcontrib>Chaudhuri, Bidyut Baran</creatorcontrib><title>diffGrad: An Optimization Method for Convolutional Neural Networks</title><title>IEEE transaction on neural networks and learning systems</title><addtitle>TNNLS</addtitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><description>Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .</description><subject>Adaptive moment estimation (Adam)</subject><subject>Artificial neural networks</subject><subject>Computer vision</subject><subject>Convergence</subject><subject>difference of gradient</subject><subject>Estimation</subject><subject>Experiments</subject><subject>gradient descent</subject><subject>image classification</subject><subject>Machine learning</subject><subject>Neural networks</subject><subject>Optimization</subject><subject>Optimization methods</subject><subject>Optimization techniques</subject><subject>Parameters</subject><subject>residual network</subject><subject>Source code</subject><subject>Stochasticity</subject><subject>Training</subject><issn>2162-237X</issn><issn>2162-2388</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkEtLAzEQgIMoVmr_gIIsePGyNY_Ny1stWoXaHqzgbUmzCW7dbWqyq-ivN33Yg3OZYeabYfgAOEOwjxCU17PJZPzcxxDJPpaUcs4PwAlGDKeYCHG4r_lrB_RCWMAYDFKWyWPQIUiIWNMTcFuU1o68Km6SwTKZrpqyLn9UU7pl8mSaN1ck1vlk6JafrmrXbVUlE9P6TWq-nH8Pp-DIqiqY3i53wcv93Wz4kI6no8fhYJzqDPEmFUgiyqChTGGLFYMaibmEVFOiEYXKGMal0hgXTGaaQWtZVigoFS2wZXxOuuBqe3fl3UdrQpPXZdCmqtTSuDbkmBCEKeKERPTyH7pwrY-_RyqjgmdSShEpvKW0dyF4Y_OVL2vlv3ME87XkfCM5X0vOd5Lj0sXudDuvTbFf-VMagfMtUBpj9mMhiaQMk1-AaH7_</recordid><startdate>20201101</startdate><enddate>20201101</enddate><creator>Dubey, Shiv Ram</creator><creator>Chakraborty, Soumendu</creator><creator>Roy, Swalpa Kumar</creator><creator>Mukherjee, Snehasis</creator><creator>Singh, Satish Kumar</creator><creator>Chaudhuri, Bidyut Baran</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QP</scope><scope>7QQ</scope><scope>7QR</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7TK</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-4532-8996</orcidid><orcidid>https://orcid.org/0000-0002-8778-8229</orcidid><orcidid>https://orcid.org/0000-0002-6580-3977</orcidid><orcidid>https://orcid.org/0000-0003-0297-8929</orcidid><orcidid>https://orcid.org/0000-0002-8536-4991</orcidid></search><sort><creationdate>20201101</creationdate><title>diffGrad: An Optimization Method for Convolutional Neural Networks</title><author>Dubey, Shiv Ram ; Chakraborty, Soumendu ; Roy, Swalpa Kumar ; Mukherjee, Snehasis ; Singh, Satish Kumar ; Chaudhuri, Bidyut Baran</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Adaptive moment estimation (Adam)</topic><topic>Artificial neural networks</topic><topic>Computer vision</topic><topic>Convergence</topic><topic>difference of gradient</topic><topic>Estimation</topic><topic>Experiments</topic><topic>gradient descent</topic><topic>image classification</topic><topic>Machine learning</topic><topic>Neural networks</topic><topic>Optimization</topic><topic>Optimization methods</topic><topic>Optimization techniques</topic><topic>Parameters</topic><topic>residual network</topic><topic>Source code</topic><topic>Stochasticity</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Dubey, Shiv Ram</creatorcontrib><creatorcontrib>Chakraborty, Soumendu</creatorcontrib><creatorcontrib>Roy, Swalpa Kumar</creatorcontrib><creatorcontrib>Mukherjee, Snehasis</creatorcontrib><creatorcontrib>Singh, Satish Kumar</creatorcontrib><creatorcontrib>Chaudhuri, Bidyut Baran</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transaction on neural networks and learning systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Dubey, Shiv Ram</au><au>Chakraborty, Soumendu</au><au>Roy, Swalpa Kumar</au><au>Mukherjee, Snehasis</au><au>Singh, Satish Kumar</au><au>Chaudhuri, Bidyut Baran</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>diffGrad: An Optimization Method for Convolutional Neural Networks</atitle><jtitle>IEEE transaction on neural networks and learning systems</jtitle><stitle>TNNLS</stitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><date>2020-11-01</date><risdate>2020</risdate><volume>31</volume><issue>11</issue><spage>4500</spage><epage>4511</epage><pages>4500-4511</pages><issn>2162-237X</issn><eissn>2162-2388</eissn><coden>ITNNAL</coden><abstract>Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .</abstract><cop>United States</cop><pub>IEEE</pub><pmid>31880565</pmid><doi>10.1109/TNNLS.2019.2955777</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-4532-8996</orcidid><orcidid>https://orcid.org/0000-0002-8778-8229</orcidid><orcidid>https://orcid.org/0000-0002-6580-3977</orcidid><orcidid>https://orcid.org/0000-0003-0297-8929</orcidid><orcidid>https://orcid.org/0000-0002-8536-4991</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2162-237X
ispartof	IEEE transaction on neural networks and learning systems, 2020-11, Vol.31 (11), p.4500-4511
issn	2162-237X 2162-2388
language	eng
recordid	cdi_proquest_miscellaneous_2331251733
source	IEEE Electronic Library (IEL)
subjects	Adaptive moment estimation (Adam) Artificial neural networks Computer vision Convergence difference of gradient Estimation Experiments gradient descent image classification Machine learning Neural networks Optimization Optimization methods Optimization techniques Parameters residual network Source code Stochasticity Training
title	diffGrad: An Optimization Method for Convolutional Neural Networks
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T22%3A55%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=diffGrad:%20An%20Optimization%20Method%20for%20Convolutional%20Neural%20Networks&rft.jtitle=IEEE%20transaction%20on%20neural%20networks%20and%20learning%20systems&rft.au=Dubey,%20Shiv%20Ram&rft.date=2020-11-01&rft.volume=31&rft.issue=11&rft.spage=4500&rft.epage=4511&rft.pages=4500-4511&rft.issn=2162-237X&rft.eissn=2162-2388&rft.coden=ITNNAL&rft_id=info:doi/10.1109/TNNLS.2019.2955777&rft_dat=%3Cproquest_RIE%3E2331251733%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2458749998&rft_id=info:pmid/31880565&rft_ieee_id=8939562&rfr_iscdi=true