Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training o...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Mudigere, Dheevatsa, Hao, Yuchen, Huang, Jianyu, Jia, Zhihao, Tulloch, Andrew, Sridharan, Srinivas, Liu, Xing, Ozdal, Mustafa, Nie, Jade, Park, Jongsoo, Luo, Liang, Yang, Jie Amy, Gao, Leon, Ivchenko, Dmytro, Basant, Aarti, Hu, Yuxi, Yang, Jiyan, Ardestani, Ehsan K, Wang, Xiaodong, Komuravelli, Rakesh, Chu, Ching-Hsiang, Yilmaz, Serhat, Li, Huayu, Qian, Jiyuan, Feng, Zhuobo, Ma, Yinbin, Yang, Junjie, Wen, Ellie, Li, Hong, Yang, Lin, Sun, Chonglin, Zhao, Whitney, Melts, Dimitry, Dhulipala, Krishna, Kishore, KR, Graf, Tyler, Eisenman, Assaf, Matam, Kiran Kumar, Gangidi, Adi, Chen, Guoqiang Jerry, Krishnan, Manoj, Nayak, Avinash, Nair, Krishnakumar, Muthiah, Bharath, khorashadi, Mahmoud, Bhattacharya, Pallab, Lapukhov, Petr, Naumov, Maxim, Mathews, Ajit, Qiao, Lin, Smelyanskiy, Mikhail, Jia, Bill, Rao, Vijay
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Mudigere, Dheevatsa
Hao, Yuchen
Huang, Jianyu
Jia, Zhihao
Tulloch, Andrew
Sridharan, Srinivas
Liu, Xing
Ozdal, Mustafa
Nie, Jade
Park, Jongsoo
Luo, Liang
Yang, Jie Amy
Gao, Leon
Ivchenko, Dmytro
Basant, Aarti
Hu, Yuxi
Yang, Jiyan
Ardestani, Ehsan K
Wang, Xiaodong
Komuravelli, Rakesh
Chu, Ching-Hsiang
Yilmaz, Serhat
Li, Huayu
Qian, Jiyuan
Feng, Zhuobo
Ma, Yinbin
Yang, Junjie
Wen, Ellie
Li, Hong
Yang, Lin
Sun, Chonglin
Zhao, Whitney
Melts, Dimitry
Dhulipala, Krishna
Kishore, KR
Graf, Tyler
Eisenman, Assaf
Matam, Kiran Kumar
Gangidi, Adi
Chen, Guoqiang Jerry
Krishnan, Manoj
Nayak, Avinash
Nair, Krishnakumar
Muthiah, Bharath
khorashadi, Mahmoud
Bhattacharya, Pallab
Lapukhov, Petr
Naumov, Maxim
Mathews, Ajit
Qiao, Lin
Smelyanskiy, Mikhail
Jia, Bill
Rao, Vijay
description Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.
doi_str_mv 10.48550/arxiv.2104.05158
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2104_05158</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2104_05158</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-bab508c986d9f2bd5dd2b7ce08fb23de4d0c2217c71b6818ecd611e4db67b2d3</originalsourceid><addsrcrecordid>eNotj71OwzAURr0woMIDMOEXSLCdOHFHFChFSoVEukfXvtdVpMSunIift4cGpu_TGY50GLuTIi-N1uIB0tfwkSspylxoqc01gy765RMSZXtIeDm8iRnSPJwC9zHxHcwLh4C8czCCHYkfEwxhCCcePX8iOvOWIK3gnVycJgoIyxADP0Skcb5hVx7GmW7_d8O63fOx2Wft28tr89hmUNUms2C1MG5rKtx6ZVEjKls7EsZbVSCVKJxSsna1tJWRhhxWUv5iW9VWYbFh93_WNbE_p2GC9N1fUvs1tfgBw3xP6w</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models</title><source>arXiv.org</source><creator>Mudigere, Dheevatsa ; Hao, Yuchen ; Huang, Jianyu ; Jia, Zhihao ; Tulloch, Andrew ; Sridharan, Srinivas ; Liu, Xing ; Ozdal, Mustafa ; Nie, Jade ; Park, Jongsoo ; Luo, Liang ; Yang, Jie Amy ; Gao, Leon ; Ivchenko, Dmytro ; Basant, Aarti ; Hu, Yuxi ; Yang, Jiyan ; Ardestani, Ehsan K ; Wang, Xiaodong ; Komuravelli, Rakesh ; Chu, Ching-Hsiang ; Yilmaz, Serhat ; Li, Huayu ; Qian, Jiyuan ; Feng, Zhuobo ; Ma, Yinbin ; Yang, Junjie ; Wen, Ellie ; Li, Hong ; Yang, Lin ; Sun, Chonglin ; Zhao, Whitney ; Melts, Dimitry ; Dhulipala, Krishna ; Kishore, KR ; Graf, Tyler ; Eisenman, Assaf ; Matam, Kiran Kumar ; Gangidi, Adi ; Chen, Guoqiang Jerry ; Krishnan, Manoj ; Nayak, Avinash ; Nair, Krishnakumar ; Muthiah, Bharath ; khorashadi, Mahmoud ; Bhattacharya, Pallab ; Lapukhov, Petr ; Naumov, Maxim ; Mathews, Ajit ; Qiao, Lin ; Smelyanskiy, Mikhail ; Jia, Bill ; Rao, Vijay</creator><creatorcontrib>Mudigere, Dheevatsa ; Hao, Yuchen ; Huang, Jianyu ; Jia, Zhihao ; Tulloch, Andrew ; Sridharan, Srinivas ; Liu, Xing ; Ozdal, Mustafa ; Nie, Jade ; Park, Jongsoo ; Luo, Liang ; Yang, Jie Amy ; Gao, Leon ; Ivchenko, Dmytro ; Basant, Aarti ; Hu, Yuxi ; Yang, Jiyan ; Ardestani, Ehsan K ; Wang, Xiaodong ; Komuravelli, Rakesh ; Chu, Ching-Hsiang ; Yilmaz, Serhat ; Li, Huayu ; Qian, Jiyuan ; Feng, Zhuobo ; Ma, Yinbin ; Yang, Junjie ; Wen, Ellie ; Li, Hong ; Yang, Lin ; Sun, Chonglin ; Zhao, Whitney ; Melts, Dimitry ; Dhulipala, Krishna ; Kishore, KR ; Graf, Tyler ; Eisenman, Assaf ; Matam, Kiran Kumar ; Gangidi, Adi ; Chen, Guoqiang Jerry ; Krishnan, Manoj ; Nayak, Avinash ; Nair, Krishnakumar ; Muthiah, Bharath ; khorashadi, Mahmoud ; Bhattacharya, Pallab ; Lapukhov, Petr ; Naumov, Maxim ; Mathews, Ajit ; Qiao, Lin ; Smelyanskiy, Mikhail ; Jia, Bill ; Rao, Vijay</creatorcontrib><description>Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.</description><identifier>DOI: 10.48550/arxiv.2104.05158</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Distributed, Parallel, and Cluster Computing ; Computer Science - Learning ; Computer Science - Performance</subject><creationdate>2021-04</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2104.05158$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2104.05158$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Mudigere, Dheevatsa</creatorcontrib><creatorcontrib>Hao, Yuchen</creatorcontrib><creatorcontrib>Huang, Jianyu</creatorcontrib><creatorcontrib>Jia, Zhihao</creatorcontrib><creatorcontrib>Tulloch, Andrew</creatorcontrib><creatorcontrib>Sridharan, Srinivas</creatorcontrib><creatorcontrib>Liu, Xing</creatorcontrib><creatorcontrib>Ozdal, Mustafa</creatorcontrib><creatorcontrib>Nie, Jade</creatorcontrib><creatorcontrib>Park, Jongsoo</creatorcontrib><creatorcontrib>Luo, Liang</creatorcontrib><creatorcontrib>Yang, Jie Amy</creatorcontrib><creatorcontrib>Gao, Leon</creatorcontrib><creatorcontrib>Ivchenko, Dmytro</creatorcontrib><creatorcontrib>Basant, Aarti</creatorcontrib><creatorcontrib>Hu, Yuxi</creatorcontrib><creatorcontrib>Yang, Jiyan</creatorcontrib><creatorcontrib>Ardestani, Ehsan K</creatorcontrib><creatorcontrib>Wang, Xiaodong</creatorcontrib><creatorcontrib>Komuravelli, Rakesh</creatorcontrib><creatorcontrib>Chu, Ching-Hsiang</creatorcontrib><creatorcontrib>Yilmaz, Serhat</creatorcontrib><creatorcontrib>Li, Huayu</creatorcontrib><creatorcontrib>Qian, Jiyuan</creatorcontrib><creatorcontrib>Feng, Zhuobo</creatorcontrib><creatorcontrib>Ma, Yinbin</creatorcontrib><creatorcontrib>Yang, Junjie</creatorcontrib><creatorcontrib>Wen, Ellie</creatorcontrib><creatorcontrib>Li, Hong</creatorcontrib><creatorcontrib>Yang, Lin</creatorcontrib><creatorcontrib>Sun, Chonglin</creatorcontrib><creatorcontrib>Zhao, Whitney</creatorcontrib><creatorcontrib>Melts, Dimitry</creatorcontrib><creatorcontrib>Dhulipala, Krishna</creatorcontrib><creatorcontrib>Kishore, KR</creatorcontrib><creatorcontrib>Graf, Tyler</creatorcontrib><creatorcontrib>Eisenman, Assaf</creatorcontrib><creatorcontrib>Matam, Kiran Kumar</creatorcontrib><creatorcontrib>Gangidi, Adi</creatorcontrib><creatorcontrib>Chen, Guoqiang Jerry</creatorcontrib><creatorcontrib>Krishnan, Manoj</creatorcontrib><creatorcontrib>Nayak, Avinash</creatorcontrib><creatorcontrib>Nair, Krishnakumar</creatorcontrib><creatorcontrib>Muthiah, Bharath</creatorcontrib><creatorcontrib>khorashadi, Mahmoud</creatorcontrib><creatorcontrib>Bhattacharya, Pallab</creatorcontrib><creatorcontrib>Lapukhov, Petr</creatorcontrib><creatorcontrib>Naumov, Maxim</creatorcontrib><creatorcontrib>Mathews, Ajit</creatorcontrib><creatorcontrib>Qiao, Lin</creatorcontrib><creatorcontrib>Smelyanskiy, Mikhail</creatorcontrib><creatorcontrib>Jia, Bill</creatorcontrib><creatorcontrib>Rao, Vijay</creatorcontrib><title>Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models</title><description>Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Performance</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71OwzAURr0woMIDMOEXSLCdOHFHFChFSoVEukfXvtdVpMSunIift4cGpu_TGY50GLuTIi-N1uIB0tfwkSspylxoqc01gy765RMSZXtIeDm8iRnSPJwC9zHxHcwLh4C8czCCHYkfEwxhCCcePX8iOvOWIK3gnVycJgoIyxADP0Skcb5hVx7GmW7_d8O63fOx2Wft28tr89hmUNUms2C1MG5rKtx6ZVEjKls7EsZbVSCVKJxSsna1tJWRhhxWUv5iW9VWYbFh93_WNbE_p2GC9N1fUvs1tfgBw3xP6w</recordid><startdate>20210411</startdate><enddate>20210411</enddate><creator>Mudigere, Dheevatsa</creator><creator>Hao, Yuchen</creator><creator>Huang, Jianyu</creator><creator>Jia, Zhihao</creator><creator>Tulloch, Andrew</creator><creator>Sridharan, Srinivas</creator><creator>Liu, Xing</creator><creator>Ozdal, Mustafa</creator><creator>Nie, Jade</creator><creator>Park, Jongsoo</creator><creator>Luo, Liang</creator><creator>Yang, Jie Amy</creator><creator>Gao, Leon</creator><creator>Ivchenko, Dmytro</creator><creator>Basant, Aarti</creator><creator>Hu, Yuxi</creator><creator>Yang, Jiyan</creator><creator>Ardestani, Ehsan K</creator><creator>Wang, Xiaodong</creator><creator>Komuravelli, Rakesh</creator><creator>Chu, Ching-Hsiang</creator><creator>Yilmaz, Serhat</creator><creator>Li, Huayu</creator><creator>Qian, Jiyuan</creator><creator>Feng, Zhuobo</creator><creator>Ma, Yinbin</creator><creator>Yang, Junjie</creator><creator>Wen, Ellie</creator><creator>Li, Hong</creator><creator>Yang, Lin</creator><creator>Sun, Chonglin</creator><creator>Zhao, Whitney</creator><creator>Melts, Dimitry</creator><creator>Dhulipala, Krishna</creator><creator>Kishore, KR</creator><creator>Graf, Tyler</creator><creator>Eisenman, Assaf</creator><creator>Matam, Kiran Kumar</creator><creator>Gangidi, Adi</creator><creator>Chen, Guoqiang Jerry</creator><creator>Krishnan, Manoj</creator><creator>Nayak, Avinash</creator><creator>Nair, Krishnakumar</creator><creator>Muthiah, Bharath</creator><creator>khorashadi, Mahmoud</creator><creator>Bhattacharya, Pallab</creator><creator>Lapukhov, Petr</creator><creator>Naumov, Maxim</creator><creator>Mathews, Ajit</creator><creator>Qiao, Lin</creator><creator>Smelyanskiy, Mikhail</creator><creator>Jia, Bill</creator><creator>Rao, Vijay</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210411</creationdate><title>Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models</title><author>Mudigere, Dheevatsa ; Hao, Yuchen ; Huang, Jianyu ; Jia, Zhihao ; Tulloch, Andrew ; Sridharan, Srinivas ; Liu, Xing ; Ozdal, Mustafa ; Nie, Jade ; Park, Jongsoo ; Luo, Liang ; Yang, Jie Amy ; Gao, Leon ; Ivchenko, Dmytro ; Basant, Aarti ; Hu, Yuxi ; Yang, Jiyan ; Ardestani, Ehsan K ; Wang, Xiaodong ; Komuravelli, Rakesh ; Chu, Ching-Hsiang ; Yilmaz, Serhat ; Li, Huayu ; Qian, Jiyuan ; Feng, Zhuobo ; Ma, Yinbin ; Yang, Junjie ; Wen, Ellie ; Li, Hong ; Yang, Lin ; Sun, Chonglin ; Zhao, Whitney ; Melts, Dimitry ; Dhulipala, Krishna ; Kishore, KR ; Graf, Tyler ; Eisenman, Assaf ; Matam, Kiran Kumar ; Gangidi, Adi ; Chen, Guoqiang Jerry ; Krishnan, Manoj ; Nayak, Avinash ; Nair, Krishnakumar ; Muthiah, Bharath ; khorashadi, Mahmoud ; Bhattacharya, Pallab ; Lapukhov, Petr ; Naumov, Maxim ; Mathews, Ajit ; Qiao, Lin ; Smelyanskiy, Mikhail ; Jia, Bill ; Rao, Vijay</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-bab508c986d9f2bd5dd2b7ce08fb23de4d0c2217c71b6818ecd611e4db67b2d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Performance</topic><toplevel>online_resources</toplevel><creatorcontrib>Mudigere, Dheevatsa</creatorcontrib><creatorcontrib>Hao, Yuchen</creatorcontrib><creatorcontrib>Huang, Jianyu</creatorcontrib><creatorcontrib>Jia, Zhihao</creatorcontrib><creatorcontrib>Tulloch, Andrew</creatorcontrib><creatorcontrib>Sridharan, Srinivas</creatorcontrib><creatorcontrib>Liu, Xing</creatorcontrib><creatorcontrib>Ozdal, Mustafa</creatorcontrib><creatorcontrib>Nie, Jade</creatorcontrib><creatorcontrib>Park, Jongsoo</creatorcontrib><creatorcontrib>Luo, Liang</creatorcontrib><creatorcontrib>Yang, Jie Amy</creatorcontrib><creatorcontrib>Gao, Leon</creatorcontrib><creatorcontrib>Ivchenko, Dmytro</creatorcontrib><creatorcontrib>Basant, Aarti</creatorcontrib><creatorcontrib>Hu, Yuxi</creatorcontrib><creatorcontrib>Yang, Jiyan</creatorcontrib><creatorcontrib>Ardestani, Ehsan K</creatorcontrib><creatorcontrib>Wang, Xiaodong</creatorcontrib><creatorcontrib>Komuravelli, Rakesh</creatorcontrib><creatorcontrib>Chu, Ching-Hsiang</creatorcontrib><creatorcontrib>Yilmaz, Serhat</creatorcontrib><creatorcontrib>Li, Huayu</creatorcontrib><creatorcontrib>Qian, Jiyuan</creatorcontrib><creatorcontrib>Feng, Zhuobo</creatorcontrib><creatorcontrib>Ma, Yinbin</creatorcontrib><creatorcontrib>Yang, Junjie</creatorcontrib><creatorcontrib>Wen, Ellie</creatorcontrib><creatorcontrib>Li, Hong</creatorcontrib><creatorcontrib>Yang, Lin</creatorcontrib><creatorcontrib>Sun, Chonglin</creatorcontrib><creatorcontrib>Zhao, Whitney</creatorcontrib><creatorcontrib>Melts, Dimitry</creatorcontrib><creatorcontrib>Dhulipala, Krishna</creatorcontrib><creatorcontrib>Kishore, KR</creatorcontrib><creatorcontrib>Graf, Tyler</creatorcontrib><creatorcontrib>Eisenman, Assaf</creatorcontrib><creatorcontrib>Matam, Kiran Kumar</creatorcontrib><creatorcontrib>Gangidi, Adi</creatorcontrib><creatorcontrib>Chen, Guoqiang Jerry</creatorcontrib><creatorcontrib>Krishnan, Manoj</creatorcontrib><creatorcontrib>Nayak, Avinash</creatorcontrib><creatorcontrib>Nair, Krishnakumar</creatorcontrib><creatorcontrib>Muthiah, Bharath</creatorcontrib><creatorcontrib>khorashadi, Mahmoud</creatorcontrib><creatorcontrib>Bhattacharya, Pallab</creatorcontrib><creatorcontrib>Lapukhov, Petr</creatorcontrib><creatorcontrib>Naumov, Maxim</creatorcontrib><creatorcontrib>Mathews, Ajit</creatorcontrib><creatorcontrib>Qiao, Lin</creatorcontrib><creatorcontrib>Smelyanskiy, Mikhail</creatorcontrib><creatorcontrib>Jia, Bill</creatorcontrib><creatorcontrib>Rao, Vijay</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mudigere, Dheevatsa</au><au>Hao, Yuchen</au><au>Huang, Jianyu</au><au>Jia, Zhihao</au><au>Tulloch, Andrew</au><au>Sridharan, Srinivas</au><au>Liu, Xing</au><au>Ozdal, Mustafa</au><au>Nie, Jade</au><au>Park, Jongsoo</au><au>Luo, Liang</au><au>Yang, Jie Amy</au><au>Gao, Leon</au><au>Ivchenko, Dmytro</au><au>Basant, Aarti</au><au>Hu, Yuxi</au><au>Yang, Jiyan</au><au>Ardestani, Ehsan K</au><au>Wang, Xiaodong</au><au>Komuravelli, Rakesh</au><au>Chu, Ching-Hsiang</au><au>Yilmaz, Serhat</au><au>Li, Huayu</au><au>Qian, Jiyuan</au><au>Feng, Zhuobo</au><au>Ma, Yinbin</au><au>Yang, Junjie</au><au>Wen, Ellie</au><au>Li, Hong</au><au>Yang, Lin</au><au>Sun, Chonglin</au><au>Zhao, Whitney</au><au>Melts, Dimitry</au><au>Dhulipala, Krishna</au><au>Kishore, KR</au><au>Graf, Tyler</au><au>Eisenman, Assaf</au><au>Matam, Kiran Kumar</au><au>Gangidi, Adi</au><au>Chen, Guoqiang Jerry</au><au>Krishnan, Manoj</au><au>Nayak, Avinash</au><au>Nair, Krishnakumar</au><au>Muthiah, Bharath</au><au>khorashadi, Mahmoud</au><au>Bhattacharya, Pallab</au><au>Lapukhov, Petr</au><au>Naumov, Maxim</au><au>Mathews, Ajit</au><au>Qiao, Lin</au><au>Smelyanskiy, Mikhail</au><au>Jia, Bill</au><au>Rao, Vijay</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models</atitle><date>2021-04-11</date><risdate>2021</risdate><abstract>Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.</abstract><doi>10.48550/arxiv.2104.05158</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2104.05158
ispartof
issn
language eng
recordid cdi_arxiv_primary_2104_05158
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Distributed, Parallel, and Cluster Computing
Computer Science - Learning
Computer Science - Performance
title Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T07%3A02%3A55IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Software-Hardware%20Co-design%20for%20Fast%20and%20Scalable%20Training%20of%20Deep%20Learning%20Recommendation%20Models&rft.au=Mudigere,%20Dheevatsa&rft.date=2021-04-11&rft_id=info:doi/10.48550/arxiv.2104.05158&rft_dat=%3Carxiv_GOX%3E2104_05158%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true