Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on block-addressable sec...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	The VLDB journal 2024, Vol.33 (5), p.1231-1255
Hauptverfasser:	Xu, Lijie, Qiu, Shuang, Yuan, Binhang, Jiang, Jiawei, Renggli, Cedric, Gan, Shaoduo, Kara, Kaan, Li, Guoliang, Liu, Ji, Wu, Wentao, Ye, Jieping, Zhang, Ce
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science Convergence Database Management Deep learning Empirical analysis Machine learning Randomness Regular Paper
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1255
container_issue	5
container_start_page	1231
container_title	The VLDB journal
container_volume	33
creator	Xu, Lijie Qiu, Shuang Yuan, Binhang Jiang, Jiawei Renggli, Cedric Gan, Shaoduo Kara, Kaan Li, Guoliang Liu, Ji Wu, Wentao Ye, Jieping Zhang, Ce
description	Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on block-addressable secondary storage such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel two-level data shuffling strategy named CorgiPile, which can avoid a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of CorgiPile and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. For deep learning systems, we extend single-process CorgiPile to multi-process CorgiPile for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that CorgiPile can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, CorgiPile is 1.6 × - 12.8 × faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, CorgiPile is 1.5 × faster than PyTorch with full data shuffle.
doi_str_mv	10.1007/s00778-024-00845-0
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3094039376</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3094039376</sourcerecordid><originalsourceid>FETCH-LOGICAL-c314t-a7e28357fc17cae2d97774964bfdfbc3e149f3d67729a1d6c518ca1f9b2087c43</originalsourceid><addsrcrecordid>eNp9kE1LxDAQhoMouK7-AU8Bz9F8tWm9yeIXLHhQwVtI02Q3SzetSYp49Zeb3Qp7cw4zMPO-78ADwCXB1wRjcRNzExXClCOMK14gfARmuOY1qoT4OAYzgssSVblOwVmMG4wxpbSYgZ_X1Ou1islpuAqqdcYn2Jqod_PLpXU_JmjHroOtSgrG9WhtZ273J6iGoXNaJdf7CFMPnUc7VaOigVul184b2BkVvPMrqHybg81w2MTvmMw2noMTq7poLv7mHLw_3L8tntDy5fF5cbdEmhGekBKGVqwQVhOhlaFtLYTgdckb29pGM0N4bVlbCkFrRdpSF6TSiti6obgSmrM5uJpyh9B_jiYmuenH4PNLyTIqzGomyqyik0qHPsZgrByC26rwLQmWO9ZyYi0za7lnLXE2sckUs9ivTDhE_-P6BbVKhLA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3094039376</pqid></control><display><type>article</type><title>Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems</title><source>ACM Digital Library</source><source>SpringerLink Journals - AutoHoldings</source><creator>Xu, Lijie ; Qiu, Shuang ; Yuan, Binhang ; Jiang, Jiawei ; Renggli, Cedric ; Gan, Shaoduo ; Kara, Kaan ; Li, Guoliang ; Liu, Ji ; Wu, Wentao ; Ye, Jieping ; Zhang, Ce</creator><creatorcontrib>Xu, Lijie ; Qiu, Shuang ; Yuan, Binhang ; Jiang, Jiawei ; Renggli, Cedric ; Gan, Shaoduo ; Kara, Kaan ; Li, Guoliang ; Liu, Ji ; Wu, Wentao ; Ye, Jieping ; Zhang, Ce</creatorcontrib><description>Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on block-addressable secondary storage such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel two-level data shuffling strategy named CorgiPile, which can avoid a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of CorgiPile and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. For deep learning systems, we extend single-process CorgiPile to multi-process CorgiPile for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that CorgiPile can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, CorgiPile is 1.6 × - 12.8 × faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, CorgiPile is 1.5 × faster than PyTorch with full data shuffle.</description><identifier>ISSN: 1066-8888</identifier><identifier>EISSN: 0949-877X</identifier><identifier>DOI: 10.1007/s00778-024-00845-0</identifier><language>eng</language><publisher>Berlin/Heidelberg: Springer Berlin Heidelberg</publisher><subject>Computer Science ; Convergence ; Database Management ; Deep learning ; Empirical analysis ; Machine learning ; Randomness ; Regular Paper</subject><ispartof>The VLDB journal, 2024, Vol.33 (5), p.1231-1255</ispartof><rights>The Author(s) 2024</rights><rights>The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c314t-a7e28357fc17cae2d97774964bfdfbc3e149f3d67729a1d6c518ca1f9b2087c43</cites><orcidid>0009-0003-2013-4241</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s00778-024-00845-0$$EPDF$$P50$$Gspringer$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s00778-024-00845-0$$EHTML$$P50$$Gspringer$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,27923,27924,41487,42556,51318</link.rule.ids></links><search><creatorcontrib>Xu, Lijie</creatorcontrib><creatorcontrib>Qiu, Shuang</creatorcontrib><creatorcontrib>Yuan, Binhang</creatorcontrib><creatorcontrib>Jiang, Jiawei</creatorcontrib><creatorcontrib>Renggli, Cedric</creatorcontrib><creatorcontrib>Gan, Shaoduo</creatorcontrib><creatorcontrib>Kara, Kaan</creatorcontrib><creatorcontrib>Li, Guoliang</creatorcontrib><creatorcontrib>Liu, Ji</creatorcontrib><creatorcontrib>Wu, Wentao</creatorcontrib><creatorcontrib>Ye, Jieping</creatorcontrib><creatorcontrib>Zhang, Ce</creatorcontrib><title>Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems</title><title>The VLDB journal</title><addtitle>The VLDB Journal</addtitle><description>Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on block-addressable secondary storage such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel two-level data shuffling strategy named CorgiPile, which can avoid a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of CorgiPile and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. For deep learning systems, we extend single-process CorgiPile to multi-process CorgiPile for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that CorgiPile can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, CorgiPile is 1.6 × - 12.8 × faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, CorgiPile is 1.5 × faster than PyTorch with full data shuffle.</description><subject>Computer Science</subject><subject>Convergence</subject><subject>Database Management</subject><subject>Deep learning</subject><subject>Empirical analysis</subject><subject>Machine learning</subject><subject>Randomness</subject><subject>Regular Paper</subject><issn>1066-8888</issn><issn>0949-877X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>C6C</sourceid><recordid>eNp9kE1LxDAQhoMouK7-AU8Bz9F8tWm9yeIXLHhQwVtI02Q3SzetSYp49Zeb3Qp7cw4zMPO-78ADwCXB1wRjcRNzExXClCOMK14gfARmuOY1qoT4OAYzgssSVblOwVmMG4wxpbSYgZ_X1Ou1islpuAqqdcYn2Jqod_PLpXU_JmjHroOtSgrG9WhtZ273J6iGoXNaJdf7CFMPnUc7VaOigVul184b2BkVvPMrqHybg81w2MTvmMw2noMTq7poLv7mHLw_3L8tntDy5fF5cbdEmhGekBKGVqwQVhOhlaFtLYTgdckb29pGM0N4bVlbCkFrRdpSF6TSiti6obgSmrM5uJpyh9B_jiYmuenH4PNLyTIqzGomyqyik0qHPsZgrByC26rwLQmWO9ZyYi0za7lnLXE2sckUs9ivTDhE_-P6BbVKhLA</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Xu, Lijie</creator><creator>Qiu, Shuang</creator><creator>Yuan, Binhang</creator><creator>Jiang, Jiawei</creator><creator>Renggli, Cedric</creator><creator>Gan, Shaoduo</creator><creator>Kara, Kaan</creator><creator>Li, Guoliang</creator><creator>Liu, Ji</creator><creator>Wu, Wentao</creator><creator>Ye, Jieping</creator><creator>Zhang, Ce</creator><general>Springer Berlin Heidelberg</general><general>Springer Nature B.V</general><scope>C6C</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0009-0003-2013-4241</orcidid></search><sort><creationdate>2024</creationdate><title>Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems</title><author>Xu, Lijie ; Qiu, Shuang ; Yuan, Binhang ; Jiang, Jiawei ; Renggli, Cedric ; Gan, Shaoduo ; Kara, Kaan ; Li, Guoliang ; Liu, Ji ; Wu, Wentao ; Ye, Jieping ; Zhang, Ce</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c314t-a7e28357fc17cae2d97774964bfdfbc3e149f3d67729a1d6c518ca1f9b2087c43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science</topic><topic>Convergence</topic><topic>Database Management</topic><topic>Deep learning</topic><topic>Empirical analysis</topic><topic>Machine learning</topic><topic>Randomness</topic><topic>Regular Paper</topic><toplevel>online_resources</toplevel><creatorcontrib>Xu, Lijie</creatorcontrib><creatorcontrib>Qiu, Shuang</creatorcontrib><creatorcontrib>Yuan, Binhang</creatorcontrib><creatorcontrib>Jiang, Jiawei</creatorcontrib><creatorcontrib>Renggli, Cedric</creatorcontrib><creatorcontrib>Gan, Shaoduo</creatorcontrib><creatorcontrib>Kara, Kaan</creatorcontrib><creatorcontrib>Li, Guoliang</creatorcontrib><creatorcontrib>Liu, Ji</creatorcontrib><creatorcontrib>Wu, Wentao</creatorcontrib><creatorcontrib>Ye, Jieping</creatorcontrib><creatorcontrib>Zhang, Ce</creatorcontrib><collection>Springer Nature OA/Free Journals</collection><collection>CrossRef</collection><jtitle>The VLDB journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xu, Lijie</au><au>Qiu, Shuang</au><au>Yuan, Binhang</au><au>Jiang, Jiawei</au><au>Renggli, Cedric</au><au>Gan, Shaoduo</au><au>Kara, Kaan</au><au>Li, Guoliang</au><au>Liu, Ji</au><au>Wu, Wentao</au><au>Ye, Jieping</au><au>Zhang, Ce</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems</atitle><jtitle>The VLDB journal</jtitle><stitle>The VLDB Journal</stitle><date>2024</date><risdate>2024</risdate><volume>33</volume><issue>5</issue><spage>1231</spage><epage>1255</epage><pages>1231-1255</pages><issn>1066-8888</issn><eissn>0949-877X</eissn><abstract>Modern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on block-addressable secondary storage such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel two-level data shuffling strategy named CorgiPile, which can avoid a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of CorgiPile and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate CorgiPile into PostgreSQL by introducing three new physical operators with optimizations. For deep learning systems, we extend single-process CorgiPile to multi-process CorgiPile for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that CorgiPile can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, CorgiPile is 1.6 × - 12.8 × faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, CorgiPile is 1.5 × faster than PyTorch with full data shuffle.</abstract><cop>Berlin/Heidelberg</cop><pub>Springer Berlin Heidelberg</pub><doi>10.1007/s00778-024-00845-0</doi><tpages>25</tpages><orcidid>https://orcid.org/0009-0003-2013-4241</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1066-8888
ispartof	The VLDB journal, 2024, Vol.33 (5), p.1231-1255
issn	1066-8888 0949-877X
language	eng
recordid	cdi_proquest_journals_3094039376
source	ACM Digital Library; SpringerLink Journals - AutoHoldings
subjects	Computer Science Convergence Database Management Deep learning Empirical analysis Machine learning Randomness Regular Paper
title	Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T17%3A57%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Stochastic%20gradient%20descent%20without%20full%20data%20shuffle:%20with%20applications%20to%20in-database%20machine%20learning%20and%20deep%20learning%20systems&rft.jtitle=The%20VLDB%20journal&rft.au=Xu,%20Lijie&rft.date=2024&rft.volume=33&rft.issue=5&rft.spage=1231&rft.epage=1255&rft.pages=1231-1255&rft.issn=1066-8888&rft.eissn=0949-877X&rft_id=info:doi/10.1007/s00778-024-00845-0&rft_dat=%3Cproquest_cross%3E3094039376%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3094039376&rft_id=info:pmid/&rfr_iscdi=true