Benchmarking network fabrics for data distributed training of deep neural networks

Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel a...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2020-08
Hauptverfasser: Samsi, Siddharth, Prout, Andrew, Jones, Michael, Kirby, Andrew, Arcand, Bill, Bergeron, Bill, Bestor, David, Byun, Chansup, Gadepally, Vijay, Houle, Michael, Hubbell, Matthew, Klein, Anna, Michaleas, Peter, Milechin, Lauren, Mullen, Julie, Rosa, Antonio, Yee, Charles, Reuther, Albert, Kepner, Jeremy
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Samsi, Siddharth
Prout, Andrew
Jones, Michael
Kirby, Andrew
Arcand, Bill
Bergeron, Bill
Bestor, David
Byun, Chansup
Gadepally, Vijay
Houle, Michael
Hubbell, Matthew
Klein, Anna
Michaleas, Peter
Milechin, Lauren
Mullen, Julie
Rosa, Antonio
Yee, Charles
Reuther, Albert
Kepner, Jeremy
description Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.
doi_str_mv 10.48550/arxiv.2008.08057
format Article
fullrecord <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2008_08057</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2435344071</sourcerecordid><originalsourceid>FETCH-LOGICAL-a521-83670942222520dd351aa28afd9effefae47053f2a82ac9bd4a96fd8767086df3</originalsourceid><addsrcrecordid>eNo1j8tOwzAQRS0kJKrSD2CFJdYtk7GdOEuoeEmVkFD30SS2wW1Jgp3w-HvcFlazOXd0DmMXGSykVgquKXz7zwUC6AVoUMUJm6AQ2VxLxDM2i3EDAJgXqJSYsJdb2zZv7xS2vn3lrR2-urDljurgm8hdF7ihgbjxcQi-Hgdr-BDIt3u6c9xY26fVGGj3P47n7NTRLtrZ352y9f3devk4Xz0_PC1vVnNSmHREXkCZnBAVgjFCZUSoyZnSOmcdWVmAEg5JIzVlbSSVuTO6SDOdGyem7PL49hBc9cGnip9qH14dwhNxdST60H2MNg7VphtDm5wqlEIJKaHIxC_IF1ws</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2435344071</pqid></control><display><type>article</type><title>Benchmarking network fabrics for data distributed training of deep neural networks</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Samsi, Siddharth ; Prout, Andrew ; Jones, Michael ; Kirby, Andrew ; Arcand, Bill ; Bergeron, Bill ; Bestor, David ; Byun, Chansup ; Gadepally, Vijay ; Houle, Michael ; Hubbell, Matthew ; Klein, Anna ; Michaleas, Peter ; Milechin, Lauren ; Mullen, Julie ; Rosa, Antonio ; Yee, Charles ; Reuther, Albert ; Kepner, Jeremy</creator><creatorcontrib>Samsi, Siddharth ; Prout, Andrew ; Jones, Michael ; Kirby, Andrew ; Arcand, Bill ; Bergeron, Bill ; Bestor, David ; Byun, Chansup ; Gadepally, Vijay ; Houle, Michael ; Hubbell, Matthew ; Klein, Anna ; Michaleas, Peter ; Milechin, Lauren ; Mullen, Julie ; Rosa, Antonio ; Yee, Charles ; Reuther, Albert ; Kepner, Jeremy</creatorcontrib><description>Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2008.08057</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial intelligence ; Artificial neural networks ; Communication ; Computational fluid dynamics ; Computer architecture ; Computer Science - Distributed, Parallel, and Cluster Computing ; Computer Science - Learning ; Computer Science - Performance ; Ethernet ; Fabrics ; Machine learning ; Neural networks ; Nodes ; Training</subject><ispartof>arXiv.org, 2020-08</ispartof><rights>2020. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,782,883,27912</link.rule.ids><backlink>$$Uhttps://doi.org/10.1109/HPEC43674.2020.9286232$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2008.08057$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Samsi, Siddharth</creatorcontrib><creatorcontrib>Prout, Andrew</creatorcontrib><creatorcontrib>Jones, Michael</creatorcontrib><creatorcontrib>Kirby, Andrew</creatorcontrib><creatorcontrib>Arcand, Bill</creatorcontrib><creatorcontrib>Bergeron, Bill</creatorcontrib><creatorcontrib>Bestor, David</creatorcontrib><creatorcontrib>Byun, Chansup</creatorcontrib><creatorcontrib>Gadepally, Vijay</creatorcontrib><creatorcontrib>Houle, Michael</creatorcontrib><creatorcontrib>Hubbell, Matthew</creatorcontrib><creatorcontrib>Klein, Anna</creatorcontrib><creatorcontrib>Michaleas, Peter</creatorcontrib><creatorcontrib>Milechin, Lauren</creatorcontrib><creatorcontrib>Mullen, Julie</creatorcontrib><creatorcontrib>Rosa, Antonio</creatorcontrib><creatorcontrib>Yee, Charles</creatorcontrib><creatorcontrib>Reuther, Albert</creatorcontrib><creatorcontrib>Kepner, Jeremy</creatorcontrib><title>Benchmarking network fabrics for data distributed training of deep neural networks</title><title>arXiv.org</title><description>Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.</description><subject>Artificial intelligence</subject><subject>Artificial neural networks</subject><subject>Communication</subject><subject>Computational fluid dynamics</subject><subject>Computer architecture</subject><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Performance</subject><subject>Ethernet</subject><subject>Fabrics</subject><subject>Machine learning</subject><subject>Neural networks</subject><subject>Nodes</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNo1j8tOwzAQRS0kJKrSD2CFJdYtk7GdOEuoeEmVkFD30SS2wW1Jgp3w-HvcFlazOXd0DmMXGSykVgquKXz7zwUC6AVoUMUJm6AQ2VxLxDM2i3EDAJgXqJSYsJdb2zZv7xS2vn3lrR2-urDljurgm8hdF7ihgbjxcQi-Hgdr-BDIt3u6c9xY26fVGGj3P47n7NTRLtrZ352y9f3devk4Xz0_PC1vVnNSmHREXkCZnBAVgjFCZUSoyZnSOmcdWVmAEg5JIzVlbSSVuTO6SDOdGyem7PL49hBc9cGnip9qH14dwhNxdST60H2MNg7VphtDm5wqlEIJKaHIxC_IF1ws</recordid><startdate>20200818</startdate><enddate>20200818</enddate><creator>Samsi, Siddharth</creator><creator>Prout, Andrew</creator><creator>Jones, Michael</creator><creator>Kirby, Andrew</creator><creator>Arcand, Bill</creator><creator>Bergeron, Bill</creator><creator>Bestor, David</creator><creator>Byun, Chansup</creator><creator>Gadepally, Vijay</creator><creator>Houle, Michael</creator><creator>Hubbell, Matthew</creator><creator>Klein, Anna</creator><creator>Michaleas, Peter</creator><creator>Milechin, Lauren</creator><creator>Mullen, Julie</creator><creator>Rosa, Antonio</creator><creator>Yee, Charles</creator><creator>Reuther, Albert</creator><creator>Kepner, Jeremy</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20200818</creationdate><title>Benchmarking network fabrics for data distributed training of deep neural networks</title><author>Samsi, Siddharth ; Prout, Andrew ; Jones, Michael ; Kirby, Andrew ; Arcand, Bill ; Bergeron, Bill ; Bestor, David ; Byun, Chansup ; Gadepally, Vijay ; Houle, Michael ; Hubbell, Matthew ; Klein, Anna ; Michaleas, Peter ; Milechin, Lauren ; Mullen, Julie ; Rosa, Antonio ; Yee, Charles ; Reuther, Albert ; Kepner, Jeremy</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a521-83670942222520dd351aa28afd9effefae47053f2a82ac9bd4a96fd8767086df3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Artificial intelligence</topic><topic>Artificial neural networks</topic><topic>Communication</topic><topic>Computational fluid dynamics</topic><topic>Computer architecture</topic><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Performance</topic><topic>Ethernet</topic><topic>Fabrics</topic><topic>Machine learning</topic><topic>Neural networks</topic><topic>Nodes</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Samsi, Siddharth</creatorcontrib><creatorcontrib>Prout, Andrew</creatorcontrib><creatorcontrib>Jones, Michael</creatorcontrib><creatorcontrib>Kirby, Andrew</creatorcontrib><creatorcontrib>Arcand, Bill</creatorcontrib><creatorcontrib>Bergeron, Bill</creatorcontrib><creatorcontrib>Bestor, David</creatorcontrib><creatorcontrib>Byun, Chansup</creatorcontrib><creatorcontrib>Gadepally, Vijay</creatorcontrib><creatorcontrib>Houle, Michael</creatorcontrib><creatorcontrib>Hubbell, Matthew</creatorcontrib><creatorcontrib>Klein, Anna</creatorcontrib><creatorcontrib>Michaleas, Peter</creatorcontrib><creatorcontrib>Milechin, Lauren</creatorcontrib><creatorcontrib>Mullen, Julie</creatorcontrib><creatorcontrib>Rosa, Antonio</creatorcontrib><creatorcontrib>Yee, Charles</creatorcontrib><creatorcontrib>Reuther, Albert</creatorcontrib><creatorcontrib>Kepner, Jeremy</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Samsi, Siddharth</au><au>Prout, Andrew</au><au>Jones, Michael</au><au>Kirby, Andrew</au><au>Arcand, Bill</au><au>Bergeron, Bill</au><au>Bestor, David</au><au>Byun, Chansup</au><au>Gadepally, Vijay</au><au>Houle, Michael</au><au>Hubbell, Matthew</au><au>Klein, Anna</au><au>Michaleas, Peter</au><au>Milechin, Lauren</au><au>Mullen, Julie</au><au>Rosa, Antonio</au><au>Yee, Charles</au><au>Reuther, Albert</au><au>Kepner, Jeremy</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Benchmarking network fabrics for data distributed training of deep neural networks</atitle><jtitle>arXiv.org</jtitle><date>2020-08-18</date><risdate>2020</risdate><eissn>2331-8422</eissn><abstract>Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2008.08057</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2020-08
issn 2331-8422
language eng
recordid cdi_arxiv_primary_2008_08057
source arXiv.org; Free E- Journals
subjects Artificial intelligence
Artificial neural networks
Communication
Computational fluid dynamics
Computer architecture
Computer Science - Distributed, Parallel, and Cluster Computing
Computer Science - Learning
Computer Science - Performance
Ethernet
Fabrics
Machine learning
Neural networks
Nodes
Training
title Benchmarking network fabrics for data distributed training of deep neural networks
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T17%3A31%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Benchmarking%20network%20fabrics%20for%20data%20distributed%20training%20of%20deep%20neural%20networks&rft.jtitle=arXiv.org&rft.au=Samsi,%20Siddharth&rft.date=2020-08-18&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2008.08057&rft_dat=%3Cproquest_arxiv%3E2435344071%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2435344071&rft_id=info:pmid/&rfr_iscdi=true