Challenging the Need for Packet Spraying in Large-Scale Distributed Training

Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-06
Hauptverfasser: Addanki, Vamsi, Goyal, Prateesh, Marinos, Ilias
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Addanki, Vamsi
Goyal, Prateesh
Marinos, Ilias
description Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying is necessary to improve the performance of large-scale distributed training workloads. In this paper, we challenge this prevailing belief and pose the question: How close can a singlepath transport approach an optimal multipath transport? We demonstrate that singlepath transport (from a NIC's perspective) is sufficient and can perform nearly as well as an ideal multipath transport with packet spraying, particularly in the context of distributed training in leaf-spine topologies. Our assertion is based on four key observations about workloads driven by collective communication patterns: (i) flows within a collective start almost simultaneously, (ii) flow sizes are nearly equal, (iii) the completion time of a collective is more crucial than individual flow completion times, and (iv) flows can be split upon arrival. We analytically prove that singlepath transport, using minimal flow splitting (at the application layer), is equivalent to an ideal multipath transport with packet spraying in terms of maximum congestion. Our preliminary evaluations support our claims. This paper suggests an alternative agenda for developing next-generation transport protocols tailored for large-scale distributed training.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3074866879</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3074866879</sourcerecordid><originalsourceid>FETCH-proquest_journals_30748668793</originalsourceid><addsrcrecordid>eNqNyk8LgjAYgPERBEn5HQadhbX5r7MVHSQCvcu0V52Nae_moW-fQR-g03N4fivicSEOQRpyviG-tQNjjMcJjyLhkTzrpdZgOmU66nqgN4AHbUekd9k8wdFiQvn-TmVoLrGDoGikBnpS1qGqZ7fwEqUyi9mRdSu1Bf_XLdlfzmV2DSYcXzNYVw3jjGZZlWBJmMZxmhzFf-oD91I9AA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3074866879</pqid></control><display><type>article</type><title>Challenging the Need for Packet Spraying in Large-Scale Distributed Training</title><source>Freely Accessible Journals</source><creator>Addanki, Vamsi ; Goyal, Prateesh ; Marinos, Ilias</creator><creatorcontrib>Addanki, Vamsi ; Goyal, Prateesh ; Marinos, Ilias</creatorcontrib><description>Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying is necessary to improve the performance of large-scale distributed training workloads. In this paper, we challenge this prevailing belief and pose the question: How close can a singlepath transport approach an optimal multipath transport? We demonstrate that singlepath transport (from a NIC's perspective) is sufficient and can perform nearly as well as an ideal multipath transport with packet spraying, particularly in the context of distributed training in leaf-spine topologies. Our assertion is based on four key observations about workloads driven by collective communication patterns: (i) flows within a collective start almost simultaneously, (ii) flow sizes are nearly equal, (iii) the completion time of a collective is more crucial than individual flow completion times, and (iv) flows can be split upon arrival. We analytically prove that singlepath transport, using minimal flow splitting (at the application layer), is equivalent to an ideal multipath transport with packet spraying in terms of maximum congestion. Our preliminary evaluations support our claims. This paper suggests an alternative agenda for developing next-generation transport protocols tailored for large-scale distributed training.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Completion time ; Ethernet ; Spraying ; Topology ; Workload ; Workloads</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>777,781</link.rule.ids></links><search><creatorcontrib>Addanki, Vamsi</creatorcontrib><creatorcontrib>Goyal, Prateesh</creatorcontrib><creatorcontrib>Marinos, Ilias</creatorcontrib><title>Challenging the Need for Packet Spraying in Large-Scale Distributed Training</title><title>arXiv.org</title><description>Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying is necessary to improve the performance of large-scale distributed training workloads. In this paper, we challenge this prevailing belief and pose the question: How close can a singlepath transport approach an optimal multipath transport? We demonstrate that singlepath transport (from a NIC's perspective) is sufficient and can perform nearly as well as an ideal multipath transport with packet spraying, particularly in the context of distributed training in leaf-spine topologies. Our assertion is based on four key observations about workloads driven by collective communication patterns: (i) flows within a collective start almost simultaneously, (ii) flow sizes are nearly equal, (iii) the completion time of a collective is more crucial than individual flow completion times, and (iv) flows can be split upon arrival. We analytically prove that singlepath transport, using minimal flow splitting (at the application layer), is equivalent to an ideal multipath transport with packet spraying in terms of maximum congestion. Our preliminary evaluations support our claims. This paper suggests an alternative agenda for developing next-generation transport protocols tailored for large-scale distributed training.</description><subject>Completion time</subject><subject>Ethernet</subject><subject>Spraying</subject><subject>Topology</subject><subject>Workload</subject><subject>Workloads</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNyk8LgjAYgPERBEn5HQadhbX5r7MVHSQCvcu0V52Nae_moW-fQR-g03N4fivicSEOQRpyviG-tQNjjMcJjyLhkTzrpdZgOmU66nqgN4AHbUekd9k8wdFiQvn-TmVoLrGDoGikBnpS1qGqZ7fwEqUyi9mRdSu1Bf_XLdlfzmV2DSYcXzNYVw3jjGZZlWBJmMZxmhzFf-oD91I9AA</recordid><startdate>20240630</startdate><enddate>20240630</enddate><creator>Addanki, Vamsi</creator><creator>Goyal, Prateesh</creator><creator>Marinos, Ilias</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240630</creationdate><title>Challenging the Need for Packet Spraying in Large-Scale Distributed Training</title><author>Addanki, Vamsi ; Goyal, Prateesh ; Marinos, Ilias</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30748668793</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Completion time</topic><topic>Ethernet</topic><topic>Spraying</topic><topic>Topology</topic><topic>Workload</topic><topic>Workloads</topic><toplevel>online_resources</toplevel><creatorcontrib>Addanki, Vamsi</creatorcontrib><creatorcontrib>Goyal, Prateesh</creatorcontrib><creatorcontrib>Marinos, Ilias</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Addanki, Vamsi</au><au>Goyal, Prateesh</au><au>Marinos, Ilias</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Challenging the Need for Packet Spraying in Large-Scale Distributed Training</atitle><jtitle>arXiv.org</jtitle><date>2024-06-30</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Large-scale distributed training in production datacenters constitutes a challenging workload bottlenecked by network communication. In response, both major industry players (e.g., Ultra Ethernet Consortium) and parts of academia have surprisingly, and almost unanimously, agreed that packet spraying is necessary to improve the performance of large-scale distributed training workloads. In this paper, we challenge this prevailing belief and pose the question: How close can a singlepath transport approach an optimal multipath transport? We demonstrate that singlepath transport (from a NIC's perspective) is sufficient and can perform nearly as well as an ideal multipath transport with packet spraying, particularly in the context of distributed training in leaf-spine topologies. Our assertion is based on four key observations about workloads driven by collective communication patterns: (i) flows within a collective start almost simultaneously, (ii) flow sizes are nearly equal, (iii) the completion time of a collective is more crucial than individual flow completion times, and (iv) flows can be split upon arrival. We analytically prove that singlepath transport, using minimal flow splitting (at the application layer), is equivalent to an ideal multipath transport with packet spraying in terms of maximum congestion. Our preliminary evaluations support our claims. This paper suggests an alternative agenda for developing next-generation transport protocols tailored for large-scale distributed training.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-06
issn 2331-8422
language eng
recordid cdi_proquest_journals_3074866879
source Freely Accessible Journals
subjects Completion time
Ethernet
Spraying
Topology
Workload
Workloads
title Challenging the Need for Packet Spraying in Large-Scale Distributed Training
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-18T05%3A49%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Challenging%20the%20Need%20for%20Packet%20Spraying%20in%20Large-Scale%20Distributed%20Training&rft.jtitle=arXiv.org&rft.au=Addanki,%20Vamsi&rft.date=2024-06-30&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3074866879%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3074866879&rft_id=info:pmid/&rfr_iscdi=true