Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training o...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Mudigere, Dheevatsa, Hao, Yuchen, Huang, Jianyu, Jia, Zhihao, Tulloch, Andrew, Sridharan, Srinivas, Liu, Xing, Ozdal, Mustafa, Nie, Jade, Park, Jongsoo, Luo, Liang, Yang, Jie Amy, Gao, Leon, Ivchenko, Dmytro, Basant, Aarti, Hu, Yuxi, Yang, Jiyan, Ardestani, Ehsan K, Wang, Xiaodong, Komuravelli, Rakesh, Chu, Ching-Hsiang, Yilmaz, Serhat, Li, Huayu, Qian, Jiyuan, Feng, Zhuobo, Ma, Yinbin, Yang, Junjie, Wen, Ellie, Li, Hong, Yang, Lin, Sun, Chonglin, Zhao, Whitney, Melts, Dimitry, Dhulipala, Krishna, Kishore, KR, Graf, Tyler, Eisenman, Assaf, Matam, Kiran Kumar, Gangidi, Adi, Chen, Guoqiang Jerry, Krishnan, Manoj, Nayak, Avinash, Nair, Krishnakumar, Muthiah, Bharath, khorashadi, Mahmoud, Bhattacharya, Pallab, Lapukhov, Petr, Naumov, Maxim, Mathews, Ajit, Qiao, Lin, Smelyanskiy, Mikhail, Jia, Bill, Rao, Vijay
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Learning Computer Science - Performance
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Mudigere, Dheevatsa Hao, Yuchen Huang, Jianyu Jia, Zhihao Tulloch, Andrew Sridharan, Srinivas Liu, Xing Ozdal, Mustafa Nie, Jade Park, Jongsoo Luo, Liang Yang, Jie Amy Gao, Leon Ivchenko, Dmytro Basant, Aarti Hu, Yuxi Yang, Jiyan Ardestani, Ehsan K Wang, Xiaodong Komuravelli, Rakesh Chu, Ching-Hsiang Yilmaz, Serhat Li, Huayu Qian, Jiyuan Feng, Zhuobo Ma, Yinbin Yang, Junjie Wen, Ellie Li, Hong Yang, Lin Sun, Chonglin Zhao, Whitney Melts, Dimitry Dhulipala, Krishna Kishore, KR Graf, Tyler Eisenman, Assaf Matam, Kiran Kumar Gangidi, Adi Chen, Guoqiang Jerry Krishnan, Manoj Nayak, Avinash Nair, Krishnakumar Muthiah, Bharath khorashadi, Mahmoud Bhattacharya, Pallab Lapukhov, Petr Naumov, Maxim Mathews, Ajit Qiao, Lin Smelyanskiy, Mikhail Jia, Bill Rao, Vijay
description	Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.
doi_str_mv	10.48550/arxiv.2104.05158
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2104_05158</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2104_05158</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-bab508c986d9f2bd5dd2b7ce08fb23de4d0c2217c71b6818ecd611e4db67b2d3</originalsourceid><addsrcrecordid>eNotj71OwzAURr0woMIDMOEXSLCdOHFHFChFSoVEukfXvtdVpMSunIift4cGpu_TGY50GLuTIi-N1uIB0tfwkSspylxoqc01gy765RMSZXtIeDm8iRnSPJwC9zHxHcwLh4C8czCCHYkfEwxhCCcePX8iOvOWIK3gnVycJgoIyxADP0Skcb5hVx7GmW7_d8O63fOx2Wft28tr89hmUNUms2C1MG5rKtx6ZVEjKls7EsZbVSCVKJxSsna1tJWRhhxWUv5iW9VWYbFh93_WNbE_p2GC9N1fUvs1tfgBw3xP6w</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models</title><source>arXiv.org</source><creator>Mudigere, Dheevatsa ; Hao, Yuchen ; Huang, Jianyu ; Jia, Zhihao ; Tulloch, Andrew ; Sridharan, Srinivas ; Liu, Xing ; Ozdal, Mustafa ; Nie, Jade ; Park, Jongsoo ; Luo, Liang ; Yang, Jie Amy ; Gao, Leon ; Ivchenko, Dmytro ; Basant, Aarti ; Hu, Yuxi ; Yang, Jiyan ; Ardestani, Ehsan K ; Wang, Xiaodong ; Komuravelli, Rakesh ; Chu, Ching-Hsiang ; Yilmaz, Serhat ; Li, Huayu ; Qian, Jiyuan ; Feng, Zhuobo ; Ma, Yinbin ; Yang, Junjie ; Wen, Ellie ; Li, Hong ; Yang, Lin ; Sun, Chonglin ; Zhao, Whitney ; Melts, Dimitry ; Dhulipala, Krishna ; Kishore, KR ; Graf, Tyler ; Eisenman, Assaf ; Matam, Kiran Kumar ; Gangidi, Adi ; Chen, Guoqiang Jerry ; Krishnan, Manoj ; Nayak, Avinash ; Nair, Krishnakumar ; Muthiah, Bharath ; khorashadi, Mahmoud ; Bhattacharya, Pallab ; Lapukhov, Petr ; Naumov, Maxim ; Mathews, Ajit ; Qiao, Lin ; Smelyanskiy, Mikhail ; Jia, Bill ; Rao, Vijay</creator><creatorcontrib>Mudigere, Dheevatsa ; Hao, Yuchen ; Huang, Jianyu ; Jia, Zhihao ; Tulloch, Andrew ; Sridharan, Srinivas ; Liu, Xing ; Ozdal, Mustafa ; Nie, Jade ; Park, Jongsoo ; Luo, Liang ; Yang, Jie Amy ; Gao, Leon ; Ivchenko, Dmytro ; Basant, Aarti ; Hu, Yuxi ; Yang, Jiyan ; Ardestani, Ehsan K ; Wang, Xiaodong ; Komuravelli, Rakesh ; Chu, Ching-Hsiang ; Yilmaz, Serhat ; Li, Huayu ; Qian, Jiyuan ; Feng, Zhuobo ; Ma, Yinbin ; Yang, Junjie ; Wen, Ellie ; Li, Hong ; Yang, Lin ; Sun, Chonglin ; Zhao, Whitney ; Melts, Dimitry ; Dhulipala, Krishna ; Kishore, KR ; Graf, Tyler ; Eisenman, Assaf ; Matam, Kiran Kumar ; Gangidi, Adi ; Chen, Guoqiang Jerry ; Krishnan, Manoj ; Nayak, Avinash ; Nair, Krishnakumar ; Muthiah, Bharath ; khorashadi, Mahmoud ; Bhattacharya, Pallab ; Lapukhov, Petr ; Naumov, Maxim ; Mathews, Ajit ; Qiao, Lin ; Smelyanskiy, Mikhail ; Jia, Bill ; Rao, Vijay</creatorcontrib><description>Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.</description><identifier>DOI: 10.48550/arxiv.2104.05158</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Distributed, Parallel, and Cluster Computing ; Computer Science - Learning ; Computer Science - Performance</subject><creationdate>2021-04</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2104.05158$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2104.05158$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Mudigere, Dheevatsa</creatorcontrib><creatorcontrib>Hao, Yuchen</creatorcontrib><creatorcontrib>Huang, Jianyu</creatorcontrib><creatorcontrib>Jia, Zhihao</creatorcontrib><creatorcontrib>Tulloch, Andrew</creatorcontrib><creatorcontrib>Sridharan, Srinivas</creatorcontrib><creatorcontrib>Liu, Xing</creatorcontrib><creatorcontrib>Ozdal, Mustafa</creatorcontrib><creatorcontrib>Nie, Jade</creatorcontrib><creatorcontrib>Park, Jongsoo</creatorcontrib><creatorcontrib>Luo, Liang</creatorcontrib><creatorcontrib>Yang, Jie Amy</creatorcontrib><creatorcontrib>Gao, Leon</creatorcontrib><creatorcontrib>Ivchenko, Dmytro</creatorcontrib><creatorcontrib>Basant, Aarti</creatorcontrib><creatorcontrib>Hu, Yuxi</creatorcontrib><creatorcontrib>Yang, Jiyan</creatorcontrib><creatorcontrib>Ardestani, Ehsan K</creatorcontrib><creatorcontrib>Wang, Xiaodong</creatorcontrib><creatorcontrib>Komuravelli, Rakesh</creatorcontrib><creatorcontrib>Chu, Ching-Hsiang</creatorcontrib><creatorcontrib>Yilmaz, Serhat</creatorcontrib><creatorcontrib>Li, Huayu</creatorcontrib><creatorcontrib>Qian, Jiyuan</creatorcontrib><creatorcontrib>Feng, Zhuobo</creatorcontrib><creatorcontrib>Ma, Yinbin</creatorcontrib><creatorcontrib>Yang, Junjie</creatorcontrib><creatorcontrib>Wen, Ellie</creatorcontrib><creatorcontrib>Li, Hong</creatorcontrib><creatorcontrib>Yang, Lin</creatorcontrib><creatorcontrib>Sun, Chonglin</creatorcontrib><creatorcontrib>Zhao, Whitney</creatorcontrib><creatorcontrib>Melts, Dimitry</creatorcontrib><creatorcontrib>Dhulipala, Krishna</creatorcontrib><creatorcontrib>Kishore, KR</creatorcontrib><creatorcontrib>Graf, Tyler</creatorcontrib><creatorcontrib>Eisenman, Assaf</creatorcontrib><creatorcontrib>Matam, Kiran Kumar</creatorcontrib><creatorcontrib>Gangidi, Adi</creatorcontrib><creatorcontrib>Chen, Guoqiang Jerry</creatorcontrib><creatorcontrib>Krishnan, Manoj</creatorcontrib><creatorcontrib>Nayak, Avinash</creatorcontrib><creatorcontrib>Nair, Krishnakumar</creatorcontrib><creatorcontrib>Muthiah, Bharath</creatorcontrib><creatorcontrib>khorashadi, Mahmoud</creatorcontrib><creatorcontrib>Bhattacharya, Pallab</creatorcontrib><creatorcontrib>Lapukhov, Petr</creatorcontrib><creatorcontrib>Naumov, Maxim</creatorcontrib><creatorcontrib>Mathews, Ajit</creatorcontrib><creatorcontrib>Qiao, Lin</creatorcontrib><creatorcontrib>Smelyanskiy, Mikhail</creatorcontrib><creatorcontrib>Jia, Bill</creatorcontrib><creatorcontrib>Rao, Vijay</creatorcontrib><title>Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models</title><description>Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Performance</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71OwzAURr0woMIDMOEXSLCdOHFHFChFSoVEukfXvtdVpMSunIift4cGpu_TGY50GLuTIi-N1uIB0tfwkSspylxoqc01gy765RMSZXtIeDm8iRnSPJwC9zHxHcwLh4C8czCCHYkfEwxhCCcePX8iOvOWIK3gnVycJgoIyxADP0Skcb5hVx7GmW7_d8O63fOx2Wft28tr89hmUNUms2C1MG5rKtx6ZVEjKls7EsZbVSCVKJxSsna1tJWRhhxWUv5iW9VWYbFh93_WNbE_p2GC9N1fUvs1tfgBw3xP6w</recordid><startdate>20210411</startdate><enddate>20210411</enddate><creator>Mudigere, Dheevatsa</creator><creator>Hao, Yuchen</creator><creator>Huang, Jianyu</creator><creator>Jia, Zhihao</creator><creator>Tulloch, Andrew</creator><creator>Sridharan, Srinivas</creator><creator>Liu, Xing</creator><creator>Ozdal, Mustafa</creator><creator>Nie, Jade</creator><creator>Park, Jongsoo</creator><creator>Luo, Liang</creator><creator>Yang, Jie Amy</creator><creator>Gao, Leon</creator><creator>Ivchenko, Dmytro</creator><creator>Basant, Aarti</creator><creator>Hu, Yuxi</creator><creator>Yang, Jiyan</creator><creator>Ardestani, Ehsan K</creator><creator>Wang, Xiaodong</creator><creator>Komuravelli, Rakesh</creator><creator>Chu, Ching-Hsiang</creator><creator>Yilmaz, Serhat</creator><creator>Li, Huayu</creator><creator>Qian, Jiyuan</creator><creator>Feng, Zhuobo</creator><creator>Ma, Yinbin</creator><creator>Yang, Junjie</creator><creator>Wen, Ellie</creator><creator>Li, Hong</creator><creator>Yang, Lin</creator><creator>Sun, Chonglin</creator><creator>Zhao, Whitney</creator><creator>Melts, Dimitry</creator><creator>Dhulipala, Krishna</creator><creator>Kishore, KR</creator><creator>Graf, Tyler</creator><creator>Eisenman, Assaf</creator><creator>Matam, Kiran Kumar</creator><creator>Gangidi, Adi</creator><creator>Chen, Guoqiang Jerry</creator><creator>Krishnan, Manoj</creator><creator>Nayak, Avinash</creator><creator>Nair, Krishnakumar</creator><creator>Muthiah, Bharath</creator><creator>khorashadi, Mahmoud</creator><creator>Bhattacharya, Pallab</creator><creator>Lapukhov, Petr</creator><creator>Naumov, Maxim</creator><creator>Mathews, Ajit</creator><creator>Qiao, Lin</creator><creator>Smelyanskiy, Mikhail</creator><creator>Jia, Bill</creator><creator>Rao, Vijay</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210411</creationdate><title>Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models</title><author>Mudigere, Dheevatsa ; Hao, Yuchen ; Huang, Jianyu ; Jia, Zhihao ; Tulloch, Andrew ; Sridharan, Srinivas ; Liu, Xing ; Ozdal, Mustafa ; Nie, Jade ; Park, Jongsoo ; Luo, Liang ; Yang, Jie Amy ; Gao, Leon ; Ivchenko, Dmytro ; Basant, Aarti ; Hu, Yuxi ; Yang, Jiyan ; Ardestani, Ehsan K ; Wang, Xiaodong ; Komuravelli, Rakesh ; Chu, Ching-Hsiang ; Yilmaz, Serhat ; Li, Huayu ; Qian, Jiyuan ; Feng, Zhuobo ; Ma, Yinbin ; Yang, Junjie ; Wen, Ellie ; Li, Hong ; Yang, Lin ; Sun, Chonglin ; Zhao, Whitney ; Melts, Dimitry ; Dhulipala, Krishna ; Kishore, KR ; Graf, Tyler ; Eisenman, Assaf ; Matam, Kiran Kumar ; Gangidi, Adi ; Chen, Guoqiang Jerry ; Krishnan, Manoj ; Nayak, Avinash ; Nair, Krishnakumar ; Muthiah, Bharath ; khorashadi, Mahmoud ; Bhattacharya, Pallab ; Lapukhov, Petr ; Naumov, Maxim ; Mathews, Ajit ; Qiao, Lin ; Smelyanskiy, Mikhail ; Jia, Bill ; Rao, Vijay</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-bab508c986d9f2bd5dd2b7ce08fb23de4d0c2217c71b6818ecd611e4db67b2d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Performance</topic><toplevel>online_resources</toplevel><creatorcontrib>Mudigere, Dheevatsa</creatorcontrib><creatorcontrib>Hao, Yuchen</creatorcontrib><creatorcontrib>Huang, Jianyu</creatorcontrib><creatorcontrib>Jia, Zhihao</creatorcontrib><creatorcontrib>Tulloch, Andrew</creatorcontrib><creatorcontrib>Sridharan, Srinivas</creatorcontrib><creatorcontrib>Liu, Xing</creatorcontrib><creatorcontrib>Ozdal, Mustafa</creatorcontrib><creatorcontrib>Nie, Jade</creatorcontrib><creatorcontrib>Park, Jongsoo</creatorcontrib><creatorcontrib>Luo, Liang</creatorcontrib><creatorcontrib>Yang, Jie Amy</creatorcontrib><creatorcontrib>Gao, Leon</creatorcontrib><creatorcontrib>Ivchenko, Dmytro</creatorcontrib><creatorcontrib>Basant, Aarti</creatorcontrib><creatorcontrib>Hu, Yuxi</creatorcontrib><creatorcontrib>Yang, Jiyan</creatorcontrib><creatorcontrib>Ardestani, Ehsan K</creatorcontrib><creatorcontrib>Wang, Xiaodong</creatorcontrib><creatorcontrib>Komuravelli, Rakesh</creatorcontrib><creatorcontrib>Chu, Ching-Hsiang</creatorcontrib><creatorcontrib>Yilmaz, Serhat</creatorcontrib><creatorcontrib>Li, Huayu</creatorcontrib><creatorcontrib>Qian, Jiyuan</creatorcontrib><creatorcontrib>Feng, Zhuobo</creatorcontrib><creatorcontrib>Ma, Yinbin</creatorcontrib><creatorcontrib>Yang, Junjie</creatorcontrib><creatorcontrib>Wen, Ellie</creatorcontrib><creatorcontrib>Li, Hong</creatorcontrib><creatorcontrib>Yang, Lin</creatorcontrib><creatorcontrib>Sun, Chonglin</creatorcontrib><creatorcontrib>Zhao, Whitney</creatorcontrib><creatorcontrib>Melts, Dimitry</creatorcontrib><creatorcontrib>Dhulipala, Krishna</creatorcontrib><creatorcontrib>Kishore, KR</creatorcontrib><creatorcontrib>Graf, Tyler</creatorcontrib><creatorcontrib>Eisenman, Assaf</creatorcontrib><creatorcontrib>Matam, Kiran Kumar</creatorcontrib><creatorcontrib>Gangidi, Adi</creatorcontrib><creatorcontrib>Chen, Guoqiang Jerry</creatorcontrib><creatorcontrib>Krishnan, Manoj</creatorcontrib><creatorcontrib>Nayak, Avinash</creatorcontrib><creatorcontrib>Nair, Krishnakumar</creatorcontrib><creatorcontrib>Muthiah, Bharath</creatorcontrib><creatorcontrib>khorashadi, Mahmoud</creatorcontrib><creatorcontrib>Bhattacharya, Pallab</creatorcontrib><creatorcontrib>Lapukhov, Petr</creatorcontrib><creatorcontrib>Naumov, Maxim</creatorcontrib><creatorcontrib>Mathews, Ajit</creatorcontrib><creatorcontrib>Qiao, Lin</creatorcontrib><creatorcontrib>Smelyanskiy, Mikhail</creatorcontrib><creatorcontrib>Jia, Bill</creatorcontrib><creatorcontrib>Rao, Vijay</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mudigere, Dheevatsa</au><au>Hao, Yuchen</au><au>Huang, Jianyu</au><au>Jia, Zhihao</au><au>Tulloch, Andrew</au><au>Sridharan, Srinivas</au><au>Liu, Xing</au><au>Ozdal, Mustafa</au><au>Nie, Jade</au><au>Park, Jongsoo</au><au>Luo, Liang</au><au>Yang, Jie Amy</au><au>Gao, Leon</au><au>Ivchenko, Dmytro</au><au>Basant, Aarti</au><au>Hu, Yuxi</au><au>Yang, Jiyan</au><au>Ardestani, Ehsan K</au><au>Wang, Xiaodong</au><au>Komuravelli, Rakesh</au><au>Chu, Ching-Hsiang</au><au>Yilmaz, Serhat</au><au>Li, Huayu</au><au>Qian, Jiyuan</au><au>Feng, Zhuobo</au><au>Ma, Yinbin</au><au>Yang, Junjie</au><au>Wen, Ellie</au><au>Li, Hong</au><au>Yang, Lin</au><au>Sun, Chonglin</au><au>Zhao, Whitney</au><au>Melts, Dimitry</au><au>Dhulipala, Krishna</au><au>Kishore, KR</au><au>Graf, Tyler</au><au>Eisenman, Assaf</au><au>Matam, Kiran Kumar</au><au>Gangidi, Adi</au><au>Chen, Guoqiang Jerry</au><au>Krishnan, Manoj</au><au>Nayak, Avinash</au><au>Nair, Krishnakumar</au><au>Muthiah, Bharath</au><au>khorashadi, Mahmoud</au><au>Bhattacharya, Pallab</au><au>Lapukhov, Petr</au><au>Naumov, Maxim</au><au>Mathews, Ajit</au><au>Qiao, Lin</au><au>Smelyanskiy, Mikhail</au><au>Jia, Bill</au><au>Rao, Vijay</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models</atitle><date>2021-04-11</date><risdate>2021</risdate><abstract>Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.</abstract><doi>10.48550/arxiv.2104.05158</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2104.05158
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2104_05158
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Learning Computer Science - Performance
title	Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T07%3A02%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Software-Hardware%20Co-design%20for%20Fast%20and%20Scalable%20Training%20of%20Deep%20Learning%20Recommendation%20Models&rft.au=Mudigere,%20Dheevatsa&rft.date=2021-04-11&rft_id=info:doi/10.48550/arxiv.2104.05158&rft_dat=%3Carxiv_GOX%3E2104_05158%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true