diffGrad: An Optimization Method for Convolutional Neural Networks
Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, ir...
Gespeichert in:
Veröffentlicht in: | IEEE transaction on neural networks and learning systems 2020-11, Vol.31 (11), p.4500-4511 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 4511 |
---|---|
container_issue | 11 |
container_start_page | 4500 |
container_title | IEEE transaction on neural networks and learning systems |
container_volume | 31 |
creator | Dubey, Shiv Ram Chakraborty, Soumendu Roy, Swalpa Kumar Mukherjee, Snehasis Singh, Satish Kumar Chaudhuri, Bidyut Baran |
description | Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad . |
doi_str_mv | 10.1109/TNNLS.2019.2955777 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_2331251733</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8939562</ieee_id><sourcerecordid>2331251733</sourcerecordid><originalsourceid>FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</originalsourceid><addsrcrecordid>eNpdkEtLAzEQgIMoVmr_gIIsePGyNY_Ny1stWoXaHqzgbUmzCW7dbWqyq-ivN33Yg3OZYeabYfgAOEOwjxCU17PJZPzcxxDJPpaUcs4PwAlGDKeYCHG4r_lrB_RCWMAYDFKWyWPQIUiIWNMTcFuU1o68Km6SwTKZrpqyLn9UU7pl8mSaN1ck1vlk6JafrmrXbVUlE9P6TWq-nH8Pp-DIqiqY3i53wcv93Wz4kI6no8fhYJzqDPEmFUgiyqChTGGLFYMaibmEVFOiEYXKGMal0hgXTGaaQWtZVigoFS2wZXxOuuBqe3fl3UdrQpPXZdCmqtTSuDbkmBCEKeKERPTyH7pwrY-_RyqjgmdSShEpvKW0dyF4Y_OVL2vlv3ME87XkfCM5X0vOd5Lj0sXudDuvTbFf-VMagfMtUBpj9mMhiaQMk1-AaH7_</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2458749998</pqid></control><display><type>article</type><title>diffGrad: An Optimization Method for Convolutional Neural Networks</title><source>IEEE Electronic Library (IEL)</source><creator>Dubey, Shiv Ram ; Chakraborty, Soumendu ; Roy, Swalpa Kumar ; Mukherjee, Snehasis ; Singh, Satish Kumar ; Chaudhuri, Bidyut Baran</creator><creatorcontrib>Dubey, Shiv Ram ; Chakraborty, Soumendu ; Roy, Swalpa Kumar ; Mukherjee, Snehasis ; Singh, Satish Kumar ; Chaudhuri, Bidyut Baran</creatorcontrib><description>Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .</description><identifier>ISSN: 2162-237X</identifier><identifier>EISSN: 2162-2388</identifier><identifier>DOI: 10.1109/TNNLS.2019.2955777</identifier><identifier>PMID: 31880565</identifier><identifier>CODEN: ITNNAL</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Adaptive moment estimation (Adam) ; Artificial neural networks ; Computer vision ; Convergence ; difference of gradient ; Estimation ; Experiments ; gradient descent ; image classification ; Machine learning ; Neural networks ; Optimization ; Optimization methods ; Optimization techniques ; Parameters ; residual network ; Source code ; Stochasticity ; Training</subject><ispartof>IEEE transaction on neural networks and learning systems, 2020-11, Vol.31 (11), p.4500-4511</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</citedby><cites>FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</cites><orcidid>0000-0002-4532-8996 ; 0000-0002-8778-8229 ; 0000-0002-6580-3977 ; 0000-0003-0297-8929 ; 0000-0002-8536-4991</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8939562$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8939562$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/31880565$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Dubey, Shiv Ram</creatorcontrib><creatorcontrib>Chakraborty, Soumendu</creatorcontrib><creatorcontrib>Roy, Swalpa Kumar</creatorcontrib><creatorcontrib>Mukherjee, Snehasis</creatorcontrib><creatorcontrib>Singh, Satish Kumar</creatorcontrib><creatorcontrib>Chaudhuri, Bidyut Baran</creatorcontrib><title>diffGrad: An Optimization Method for Convolutional Neural Networks</title><title>IEEE transaction on neural networks and learning systems</title><addtitle>TNNLS</addtitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><description>Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .</description><subject>Adaptive moment estimation (Adam)</subject><subject>Artificial neural networks</subject><subject>Computer vision</subject><subject>Convergence</subject><subject>difference of gradient</subject><subject>Estimation</subject><subject>Experiments</subject><subject>gradient descent</subject><subject>image classification</subject><subject>Machine learning</subject><subject>Neural networks</subject><subject>Optimization</subject><subject>Optimization methods</subject><subject>Optimization techniques</subject><subject>Parameters</subject><subject>residual network</subject><subject>Source code</subject><subject>Stochasticity</subject><subject>Training</subject><issn>2162-237X</issn><issn>2162-2388</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkEtLAzEQgIMoVmr_gIIsePGyNY_Ny1stWoXaHqzgbUmzCW7dbWqyq-ivN33Yg3OZYeabYfgAOEOwjxCU17PJZPzcxxDJPpaUcs4PwAlGDKeYCHG4r_lrB_RCWMAYDFKWyWPQIUiIWNMTcFuU1o68Km6SwTKZrpqyLn9UU7pl8mSaN1ck1vlk6JafrmrXbVUlE9P6TWq-nH8Pp-DIqiqY3i53wcv93Wz4kI6no8fhYJzqDPEmFUgiyqChTGGLFYMaibmEVFOiEYXKGMal0hgXTGaaQWtZVigoFS2wZXxOuuBqe3fl3UdrQpPXZdCmqtTSuDbkmBCEKeKERPTyH7pwrY-_RyqjgmdSShEpvKW0dyF4Y_OVL2vlv3ME87XkfCM5X0vOd5Lj0sXudDuvTbFf-VMagfMtUBpj9mMhiaQMk1-AaH7_</recordid><startdate>20201101</startdate><enddate>20201101</enddate><creator>Dubey, Shiv Ram</creator><creator>Chakraborty, Soumendu</creator><creator>Roy, Swalpa Kumar</creator><creator>Mukherjee, Snehasis</creator><creator>Singh, Satish Kumar</creator><creator>Chaudhuri, Bidyut Baran</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QP</scope><scope>7QQ</scope><scope>7QR</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7TK</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-4532-8996</orcidid><orcidid>https://orcid.org/0000-0002-8778-8229</orcidid><orcidid>https://orcid.org/0000-0002-6580-3977</orcidid><orcidid>https://orcid.org/0000-0003-0297-8929</orcidid><orcidid>https://orcid.org/0000-0002-8536-4991</orcidid></search><sort><creationdate>20201101</creationdate><title>diffGrad: An Optimization Method for Convolutional Neural Networks</title><author>Dubey, Shiv Ram ; Chakraborty, Soumendu ; Roy, Swalpa Kumar ; Mukherjee, Snehasis ; Singh, Satish Kumar ; Chaudhuri, Bidyut Baran</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c417t-8191560e56a2f2a60c18b905c53c150aee679ac22d694c60ff64da09a5d2f67b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Adaptive moment estimation (Adam)</topic><topic>Artificial neural networks</topic><topic>Computer vision</topic><topic>Convergence</topic><topic>difference of gradient</topic><topic>Estimation</topic><topic>Experiments</topic><topic>gradient descent</topic><topic>image classification</topic><topic>Machine learning</topic><topic>Neural networks</topic><topic>Optimization</topic><topic>Optimization methods</topic><topic>Optimization techniques</topic><topic>Parameters</topic><topic>residual network</topic><topic>Source code</topic><topic>Stochasticity</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Dubey, Shiv Ram</creatorcontrib><creatorcontrib>Chakraborty, Soumendu</creatorcontrib><creatorcontrib>Roy, Swalpa Kumar</creatorcontrib><creatorcontrib>Mukherjee, Snehasis</creatorcontrib><creatorcontrib>Singh, Satish Kumar</creatorcontrib><creatorcontrib>Chaudhuri, Bidyut Baran</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transaction on neural networks and learning systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Dubey, Shiv Ram</au><au>Chakraborty, Soumendu</au><au>Roy, Swalpa Kumar</au><au>Mukherjee, Snehasis</au><au>Singh, Satish Kumar</au><au>Chaudhuri, Bidyut Baran</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>diffGrad: An Optimization Method for Convolutional Neural Networks</atitle><jtitle>IEEE transaction on neural networks and learning systems</jtitle><stitle>TNNLS</stitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><date>2020-11-01</date><risdate>2020</risdate><volume>31</volume><issue>11</issue><spage>4500</spage><epage>4511</epage><pages>4500-4511</pages><issn>2162-237X</issn><eissn>2162-2388</eissn><coden>ITNNAL</coden><abstract>Stochastic gradient descent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic SGD is to change by equal-sized steps for all parameters, irrespective of the gradient behavior. Hence, an efficient way of deep network optimization is to have adaptive step sizes for each parameter. Recently, several attempts have been made to improve gradient descent methods such as AdaGrad, AdaDelta, RMSProp, and adaptive moment estimation (Adam). These methods rely on the square roots of exponential moving averages of squared past gradients. Thus, these methods do not take advantage of local change in gradients. In this article, a novel optimizer is proposed based on the difference between the present and the immediate past gradient (i.e., diffGrad). In the proposed diffGrad optimization technique, the step size is adjusted for each parameter in such a way that it should have a larger step size for faster gradient changing parameters and a lower step size for lower gradient changing parameters. The convergence analysis is done using the regret bound approach of the online learning framework. In this article, thorough analysis is made over three synthetic complex nonconvex functions. The image categorization experiments are also conducted over the CIFAR10 and CIFAR100 data sets to observe the performance of diffGrad with respect to the state-of-the-art optimizers such as SGDM, AdaGrad, AdaDelta, RMSProp, AMSGrad, and Adam. The residual unit (ResNet)-based convolutional neural network (CNN) architecture is used in the experiments. The experiments show that diffGrad outperforms other optimizers. Also, we show that diffGrad performs uniformly well for training CNN using different activation functions. The source code is made publicly available at https://github.com/shivram1987/diffGrad .</abstract><cop>United States</cop><pub>IEEE</pub><pmid>31880565</pmid><doi>10.1109/TNNLS.2019.2955777</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-4532-8996</orcidid><orcidid>https://orcid.org/0000-0002-8778-8229</orcidid><orcidid>https://orcid.org/0000-0002-6580-3977</orcidid><orcidid>https://orcid.org/0000-0003-0297-8929</orcidid><orcidid>https://orcid.org/0000-0002-8536-4991</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2162-237X |
ispartof | IEEE transaction on neural networks and learning systems, 2020-11, Vol.31 (11), p.4500-4511 |
issn | 2162-237X 2162-2388 |
language | eng |
recordid | cdi_proquest_miscellaneous_2331251733 |
source | IEEE Electronic Library (IEL) |
subjects | Adaptive moment estimation (Adam) Artificial neural networks Computer vision Convergence difference of gradient Estimation Experiments gradient descent image classification Machine learning Neural networks Optimization Optimization methods Optimization techniques Parameters residual network Source code Stochasticity Training |
title | diffGrad: An Optimization Method for Convolutional Neural Networks |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T22%3A55%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=diffGrad:%20An%20Optimization%20Method%20for%20Convolutional%20Neural%20Networks&rft.jtitle=IEEE%20transaction%20on%20neural%20networks%20and%20learning%20systems&rft.au=Dubey,%20Shiv%20Ram&rft.date=2020-11-01&rft.volume=31&rft.issue=11&rft.spage=4500&rft.epage=4511&rft.pages=4500-4511&rft.issn=2162-237X&rft.eissn=2162-2388&rft.coden=ITNNAL&rft_id=info:doi/10.1109/TNNLS.2019.2955777&rft_dat=%3Cproquest_RIE%3E2331251733%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2458749998&rft_id=info:pmid/31880565&rft_ieee_id=8939562&rfr_iscdi=true |