Compressing Gradients by Exploiting Temporal Correlation in Momentum-SGD

An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information exchanged is quickly growing. While compression techniques have been introduced to cope with the latte...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE journal on selected areas in information theory 2021-09, Vol.2 (3), p.970-986
Hauptverfasser:	Adikari, Tharindu B., Draper, Stark C.
Format:	Artikel
Sprache:	eng
Schlagworte:	Compressing Compressors Computation Computational modeling Convergence Correlation Datasets Distributed optimization Feedback gradient compression Momentum momentum SGD Optimization Prediction algorithms Predictive coding Quantization (signal) rate distortion
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	986
container_issue	3
container_start_page	970
container_title	IEEE journal on selected areas in information theory
container_volume	2
creator	Adikari, Tharindu B. Draper, Stark C.
description	An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information exchanged is quickly growing. While compression techniques have been introduced to cope with the latter, none has considered leveraging the temporal correlations that exist in consecutive vector updates. An important example is distributed momentum-SGD where temporal correlation is enhanced by the low-pass-filtering effect of applying momentum. In this paper we design and analyze compression methods that exploit temporal correlation in systems both with and without error-feedback. Experiments with the ImageNet dataset demonstrate that our proposed methods offer significant reduction in the rate of communication at only a negligible increase in computation complexity. We further analyze the convergence of SGD when compression is applied with error-feedback. In the literature, convergence guarantees are developed only for compressors that provide error-bounds point-wise, i.e., for each input to the compressor. In contrast, many important codes (e.g., rate-distortion codes) provide error-bounds only in expectation and thus provide a more general guarantee. In this paper we prove the convergence of SGD under an expected error assumption by establishing a bound for the minimum gradient norm.
doi_str_mv	10.1109/JSAIT.2021.3103494
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2575128706</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9511618</ieee_id><sourcerecordid>2575128706</sourcerecordid><originalsourceid>FETCH-LOGICAL-c210t-26c1c3265b9800d8915ef231a0f5c6955f2dae7f43ee99152273f8c16e4053013</originalsourceid><addsrcrecordid>eNpNkMFOwkAQhjdGEwnyAnpp4rk4M9vttkeCCBiMB_C8KWVqStpu3S2JvL1FiPE0k5nvn0k-Ie4RxoiQPr2uJ8vNmIBwLBFklEZXYkBxhGGiNVz_62_FyPs9ABBhpBM9EIuprVvH3pfNZzB32a7kpvPB9hjMvtvKlt1pvuG6tS6rgql1jqusK20TlE3wZuuePtThev58J26KrPI8utSh-HiZbaaLcPU-X04nqzAnhC6kOMdcUqy2aQKwS1JUXJDEDAqVx6lSBe0y1kUkmdN-SaRlkeQYcwRKAsqheDzfbZ39OrDvzN4eXNO_NKS0Qko0xD1FZyp31nvHhWldWWfuaBDMSZr5lWZO0sxFWh96OIdKZv4LpAoxxkT-AIUJZrs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2575128706</pqid></control><display><type>article</type><title>Compressing Gradients by Exploiting Temporal Correlation in Momentum-SGD</title><source>IEEE Xplore</source><creator>Adikari, Tharindu B. ; Draper, Stark C.</creator><creatorcontrib>Adikari, Tharindu B. ; Draper, Stark C.</creatorcontrib><description>An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information exchanged is quickly growing. While compression techniques have been introduced to cope with the latter, none has considered leveraging the temporal correlations that exist in consecutive vector updates. An important example is distributed momentum-SGD where temporal correlation is enhanced by the low-pass-filtering effect of applying momentum. In this paper we design and analyze compression methods that exploit temporal correlation in systems both with and without error-feedback. Experiments with the ImageNet dataset demonstrate that our proposed methods offer significant reduction in the rate of communication at only a negligible increase in computation complexity. We further analyze the convergence of SGD when compression is applied with error-feedback. In the literature, convergence guarantees are developed only for compressors that provide error-bounds point-wise, i.e., for each input to the compressor. In contrast, many important codes (e.g., rate-distortion codes) provide error-bounds only in expectation and thus provide a more general guarantee. In this paper we prove the convergence of SGD under an expected error assumption by establishing a bound for the minimum gradient norm.</description><identifier>ISSN: 2641-8770</identifier><identifier>EISSN: 2641-8770</identifier><identifier>DOI: 10.1109/JSAIT.2021.3103494</identifier><identifier>CODEN: IJSTL5</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Compressing ; Compressors ; Computation ; Computational modeling ; Convergence ; Correlation ; Datasets ; Distributed optimization ; Feedback ; gradient compression ; Momentum ; momentum SGD ; Optimization ; Prediction algorithms ; Predictive coding ; Quantization (signal) ; rate distortion</subject><ispartof>IEEE journal on selected areas in information theory, 2021-09, Vol.2 (3), p.970-986</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c210t-26c1c3265b9800d8915ef231a0f5c6955f2dae7f43ee99152273f8c16e4053013</citedby><cites>FETCH-LOGICAL-c210t-26c1c3265b9800d8915ef231a0f5c6955f2dae7f43ee99152273f8c16e4053013</cites><orcidid>0000-0002-7628-7568 ; 0000-0001-8100-5599</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9511618$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9511618$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Adikari, Tharindu B.</creatorcontrib><creatorcontrib>Draper, Stark C.</creatorcontrib><title>Compressing Gradients by Exploiting Temporal Correlation in Momentum-SGD</title><title>IEEE journal on selected areas in information theory</title><addtitle>JSAIT</addtitle><description>An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information exchanged is quickly growing. While compression techniques have been introduced to cope with the latter, none has considered leveraging the temporal correlations that exist in consecutive vector updates. An important example is distributed momentum-SGD where temporal correlation is enhanced by the low-pass-filtering effect of applying momentum. In this paper we design and analyze compression methods that exploit temporal correlation in systems both with and without error-feedback. Experiments with the ImageNet dataset demonstrate that our proposed methods offer significant reduction in the rate of communication at only a negligible increase in computation complexity. We further analyze the convergence of SGD when compression is applied with error-feedback. In the literature, convergence guarantees are developed only for compressors that provide error-bounds point-wise, i.e., for each input to the compressor. In contrast, many important codes (e.g., rate-distortion codes) provide error-bounds only in expectation and thus provide a more general guarantee. In this paper we prove the convergence of SGD under an expected error assumption by establishing a bound for the minimum gradient norm.</description><subject>Compressing</subject><subject>Compressors</subject><subject>Computation</subject><subject>Computational modeling</subject><subject>Convergence</subject><subject>Correlation</subject><subject>Datasets</subject><subject>Distributed optimization</subject><subject>Feedback</subject><subject>gradient compression</subject><subject>Momentum</subject><subject>momentum SGD</subject><subject>Optimization</subject><subject>Prediction algorithms</subject><subject>Predictive coding</subject><subject>Quantization (signal)</subject><subject>rate distortion</subject><issn>2641-8770</issn><issn>2641-8770</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkMFOwkAQhjdGEwnyAnpp4rk4M9vttkeCCBiMB_C8KWVqStpu3S2JvL1FiPE0k5nvn0k-Ie4RxoiQPr2uJ8vNmIBwLBFklEZXYkBxhGGiNVz_62_FyPs9ABBhpBM9EIuprVvH3pfNZzB32a7kpvPB9hjMvtvKlt1pvuG6tS6rgql1jqusK20TlE3wZuuePtThev58J26KrPI8utSh-HiZbaaLcPU-X04nqzAnhC6kOMdcUqy2aQKwS1JUXJDEDAqVx6lSBe0y1kUkmdN-SaRlkeQYcwRKAsqheDzfbZ39OrDvzN4eXNO_NKS0Qko0xD1FZyp31nvHhWldWWfuaBDMSZr5lWZO0sxFWh96OIdKZv4LpAoxxkT-AIUJZrs</recordid><startdate>20210901</startdate><enddate>20210901</enddate><creator>Adikari, Tharindu B.</creator><creator>Draper, Stark C.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-7628-7568</orcidid><orcidid>https://orcid.org/0000-0001-8100-5599</orcidid></search><sort><creationdate>20210901</creationdate><title>Compressing Gradients by Exploiting Temporal Correlation in Momentum-SGD</title><author>Adikari, Tharindu B. ; Draper, Stark C.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c210t-26c1c3265b9800d8915ef231a0f5c6955f2dae7f43ee99152273f8c16e4053013</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Compressing</topic><topic>Compressors</topic><topic>Computation</topic><topic>Computational modeling</topic><topic>Convergence</topic><topic>Correlation</topic><topic>Datasets</topic><topic>Distributed optimization</topic><topic>Feedback</topic><topic>gradient compression</topic><topic>Momentum</topic><topic>momentum SGD</topic><topic>Optimization</topic><topic>Prediction algorithms</topic><topic>Predictive coding</topic><topic>Quantization (signal)</topic><topic>rate distortion</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Adikari, Tharindu B.</creatorcontrib><creatorcontrib>Draper, Stark C.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Xplore</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE journal on selected areas in information theory</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Adikari, Tharindu B.</au><au>Draper, Stark C.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Compressing Gradients by Exploiting Temporal Correlation in Momentum-SGD</atitle><jtitle>IEEE journal on selected areas in information theory</jtitle><stitle>JSAIT</stitle><date>2021-09-01</date><risdate>2021</risdate><volume>2</volume><issue>3</issue><spage>970</spage><epage>986</epage><pages>970-986</pages><issn>2641-8770</issn><eissn>2641-8770</eissn><coden>IJSTL5</coden><abstract>An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information exchanged is quickly growing. While compression techniques have been introduced to cope with the latter, none has considered leveraging the temporal correlations that exist in consecutive vector updates. An important example is distributed momentum-SGD where temporal correlation is enhanced by the low-pass-filtering effect of applying momentum. In this paper we design and analyze compression methods that exploit temporal correlation in systems both with and without error-feedback. Experiments with the ImageNet dataset demonstrate that our proposed methods offer significant reduction in the rate of communication at only a negligible increase in computation complexity. We further analyze the convergence of SGD when compression is applied with error-feedback. In the literature, convergence guarantees are developed only for compressors that provide error-bounds point-wise, i.e., for each input to the compressor. In contrast, many important codes (e.g., rate-distortion codes) provide error-bounds only in expectation and thus provide a more general guarantee. In this paper we prove the convergence of SGD under an expected error assumption by establishing a bound for the minimum gradient norm.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/JSAIT.2021.3103494</doi><tpages>17</tpages><orcidid>https://orcid.org/0000-0002-7628-7568</orcidid><orcidid>https://orcid.org/0000-0001-8100-5599</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2641-8770
ispartof	IEEE journal on selected areas in information theory, 2021-09, Vol.2 (3), p.970-986
issn	2641-8770 2641-8770
language	eng
recordid	cdi_proquest_journals_2575128706
source	IEEE Xplore
subjects	Compressing Compressors Computation Computational modeling Convergence Correlation Datasets Distributed optimization Feedback gradient compression Momentum momentum SGD Optimization Prediction algorithms Predictive coding Quantization (signal) rate distortion
title	Compressing Gradients by Exploiting Temporal Correlation in Momentum-SGD
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T19%3A42%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Compressing%20Gradients%20by%20Exploiting%20Temporal%20Correlation%20in%20Momentum-SGD&rft.jtitle=IEEE%20journal%20on%20selected%20areas%20in%20information%20theory&rft.au=Adikari,%20Tharindu%20B.&rft.date=2021-09-01&rft.volume=2&rft.issue=3&rft.spage=970&rft.epage=986&rft.pages=970-986&rft.issn=2641-8770&rft.eissn=2641-8770&rft.coden=IJSTL5&rft_id=info:doi/10.1109/JSAIT.2021.3103494&rft_dat=%3Cproquest_RIE%3E2575128706%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2575128706&rft_id=info:pmid/&rft_ieee_id=9511618&rfr_iscdi=true