Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with their sophisticated capabilities. Training these models requires vast GPU clusters and significant computing time, posing major challenges in terms of scalability, efficiency, and reliability. This survey explor...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Duan, Jiangfei, Zhang, Shuo, Wang, Zerui, Jiang, Lijuan, Qu, Wenwen, Hu, Qinghao, Wang, Guoteng, Weng, Qizhen, Yan, Hang, Zhang, Xingcheng, Qiu, Xipeng, Lin, Dahua, Wen, Yonggang, Jin, Xin, Zhang, Tianwei, Sun, Peng
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Distributed, Parallel, and Cluster Computing
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Duan, Jiangfei Zhang, Shuo Wang, Zerui Jiang, Lijuan Qu, Wenwen Hu, Qinghao Wang, Guoteng Weng, Qizhen Yan, Hang Zhang, Xingcheng Qiu, Xipeng Lin, Dahua Wen, Yonggang Jin, Xin Zhang, Tianwei Sun, Peng
description	Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with their sophisticated capabilities. Training these models requires vast GPU clusters and significant computing time, posing major challenges in terms of scalability, efficiency, and reliability. This survey explores recent advancements in training systems for LLMs, including innovations in training infrastructure with AI accelerators, networking, storage, and scheduling. Additionally, the survey covers parallelism strategies, as well as optimizations for computation, communication, and memory in distributed LLM training. It also includes approaches of maintaining system reliability over extended training periods. By examining current innovations and future directions, this survey aims to provide valuable insights towards improving LLM training systems and tackling ongoing challenges. Furthermore, traditional digital circuit-based computing systems face significant constraints in meeting the computational demands of LLMs, highlighting the need for innovative solutions such as optical computing and optical networks.
doi_str_mv	10.48550/arxiv.2407.20018
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2407_20018</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2407_20018</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2407_200183</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1zMyMDC04GQIcU1Ly0zOTM0rUQgpSszMy8xLV8hPU_BJLEpPBZJ56aWJQIZvfkpqTrFCfp6CS2ZxSVFmUmlJaoqCZ15aUSKQW5pcUlqUWmyl4KgQXFpUllrJw8CalphTnMoLpbkZ5N1cQ5w9dMH2xxcUZeYmFlXGg9wRD3aHMWEVAOyJPd4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Efficient Training of Large Language Models on Distributed Infrastructures: A Survey</title><source>arXiv.org</source><creator>Duan, Jiangfei ; Zhang, Shuo ; Wang, Zerui ; Jiang, Lijuan ; Qu, Wenwen ; Hu, Qinghao ; Wang, Guoteng ; Weng, Qizhen ; Yan, Hang ; Zhang, Xingcheng ; Qiu, Xipeng ; Lin, Dahua ; Wen, Yonggang ; Jin, Xin ; Zhang, Tianwei ; Sun, Peng</creator><creatorcontrib>Duan, Jiangfei ; Zhang, Shuo ; Wang, Zerui ; Jiang, Lijuan ; Qu, Wenwen ; Hu, Qinghao ; Wang, Guoteng ; Weng, Qizhen ; Yan, Hang ; Zhang, Xingcheng ; Qiu, Xipeng ; Lin, Dahua ; Wen, Yonggang ; Jin, Xin ; Zhang, Tianwei ; Sun, Peng</creatorcontrib><description>Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with their sophisticated capabilities. Training these models requires vast GPU clusters and significant computing time, posing major challenges in terms of scalability, efficiency, and reliability. This survey explores recent advancements in training systems for LLMs, including innovations in training infrastructure with AI accelerators, networking, storage, and scheduling. Additionally, the survey covers parallelism strategies, as well as optimizations for computation, communication, and memory in distributed LLM training. It also includes approaches of maintaining system reliability over extended training periods. By examining current innovations and future directions, this survey aims to provide valuable insights towards improving LLM training systems and tackling ongoing challenges. Furthermore, traditional digital circuit-based computing systems face significant constraints in meeting the computational demands of LLMs, highlighting the need for innovative solutions such as optical computing and optical networks.</description><identifier>DOI: 10.48550/arxiv.2407.20018</identifier><language>eng</language><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><creationdate>2024-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2407.20018$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2407.20018$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Duan, Jiangfei</creatorcontrib><creatorcontrib>Zhang, Shuo</creatorcontrib><creatorcontrib>Wang, Zerui</creatorcontrib><creatorcontrib>Jiang, Lijuan</creatorcontrib><creatorcontrib>Qu, Wenwen</creatorcontrib><creatorcontrib>Hu, Qinghao</creatorcontrib><creatorcontrib>Wang, Guoteng</creatorcontrib><creatorcontrib>Weng, Qizhen</creatorcontrib><creatorcontrib>Yan, Hang</creatorcontrib><creatorcontrib>Zhang, Xingcheng</creatorcontrib><creatorcontrib>Qiu, Xipeng</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Wen, Yonggang</creatorcontrib><creatorcontrib>Jin, Xin</creatorcontrib><creatorcontrib>Zhang, Tianwei</creatorcontrib><creatorcontrib>Sun, Peng</creatorcontrib><title>Efficient Training of Large Language Models on Distributed Infrastructures: A Survey</title><description>Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with their sophisticated capabilities. Training these models requires vast GPU clusters and significant computing time, posing major challenges in terms of scalability, efficiency, and reliability. This survey explores recent advancements in training systems for LLMs, including innovations in training infrastructure with AI accelerators, networking, storage, and scheduling. Additionally, the survey covers parallelism strategies, as well as optimizations for computation, communication, and memory in distributed LLM training. It also includes approaches of maintaining system reliability over extended training periods. By examining current innovations and future directions, this survey aims to provide valuable insights towards improving LLM training systems and tackling ongoing challenges. Furthermore, traditional digital circuit-based computing systems face significant constraints in meeting the computational demands of LLMs, highlighting the need for innovative solutions such as optical computing and optical networks.</description><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1zMyMDC04GQIcU1Ly0zOTM0rUQgpSszMy8xLV8hPU_BJLEpPBZJ56aWJQIZvfkpqTrFCfp6CS2ZxSVFmUmlJaoqCZ15aUSKQW5pcUlqUWmyl4KgQXFpUllrJw8CalphTnMoLpbkZ5N1cQ5w9dMH2xxcUZeYmFlXGg9wRD3aHMWEVAOyJPd4</recordid><startdate>20240729</startdate><enddate>20240729</enddate><creator>Duan, Jiangfei</creator><creator>Zhang, Shuo</creator><creator>Wang, Zerui</creator><creator>Jiang, Lijuan</creator><creator>Qu, Wenwen</creator><creator>Hu, Qinghao</creator><creator>Wang, Guoteng</creator><creator>Weng, Qizhen</creator><creator>Yan, Hang</creator><creator>Zhang, Xingcheng</creator><creator>Qiu, Xipeng</creator><creator>Lin, Dahua</creator><creator>Wen, Yonggang</creator><creator>Jin, Xin</creator><creator>Zhang, Tianwei</creator><creator>Sun, Peng</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240729</creationdate><title>Efficient Training of Large Language Models on Distributed Infrastructures: A Survey</title><author>Duan, Jiangfei ; Zhang, Shuo ; Wang, Zerui ; Jiang, Lijuan ; Qu, Wenwen ; Hu, Qinghao ; Wang, Guoteng ; Weng, Qizhen ; Yan, Hang ; Zhang, Xingcheng ; Qiu, Xipeng ; Lin, Dahua ; Wen, Yonggang ; Jin, Xin ; Zhang, Tianwei ; Sun, Peng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2407_200183</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Duan, Jiangfei</creatorcontrib><creatorcontrib>Zhang, Shuo</creatorcontrib><creatorcontrib>Wang, Zerui</creatorcontrib><creatorcontrib>Jiang, Lijuan</creatorcontrib><creatorcontrib>Qu, Wenwen</creatorcontrib><creatorcontrib>Hu, Qinghao</creatorcontrib><creatorcontrib>Wang, Guoteng</creatorcontrib><creatorcontrib>Weng, Qizhen</creatorcontrib><creatorcontrib>Yan, Hang</creatorcontrib><creatorcontrib>Zhang, Xingcheng</creatorcontrib><creatorcontrib>Qiu, Xipeng</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Wen, Yonggang</creatorcontrib><creatorcontrib>Jin, Xin</creatorcontrib><creatorcontrib>Zhang, Tianwei</creatorcontrib><creatorcontrib>Sun, Peng</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Duan, Jiangfei</au><au>Zhang, Shuo</au><au>Wang, Zerui</au><au>Jiang, Lijuan</au><au>Qu, Wenwen</au><au>Hu, Qinghao</au><au>Wang, Guoteng</au><au>Weng, Qizhen</au><au>Yan, Hang</au><au>Zhang, Xingcheng</au><au>Qiu, Xipeng</au><au>Lin, Dahua</au><au>Wen, Yonggang</au><au>Jin, Xin</au><au>Zhang, Tianwei</au><au>Sun, Peng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Efficient Training of Large Language Models on Distributed Infrastructures: A Survey</atitle><date>2024-07-29</date><risdate>2024</risdate><abstract>Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with their sophisticated capabilities. Training these models requires vast GPU clusters and significant computing time, posing major challenges in terms of scalability, efficiency, and reliability. This survey explores recent advancements in training systems for LLMs, including innovations in training infrastructure with AI accelerators, networking, storage, and scheduling. Additionally, the survey covers parallelism strategies, as well as optimizations for computation, communication, and memory in distributed LLM training. It also includes approaches of maintaining system reliability over extended training periods. By examining current innovations and future directions, this survey aims to provide valuable insights towards improving LLM training systems and tackling ongoing challenges. Furthermore, traditional digital circuit-based computing systems face significant constraints in meeting the computational demands of LLMs, highlighting the need for innovative solutions such as optical computing and optical networks.</abstract><doi>10.48550/arxiv.2407.20018</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2407.20018
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2407_20018
source	arXiv.org
subjects	Computer Science - Distributed, Parallel, and Cluster Computing
title	Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T17%3A22%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Efficient%20Training%20of%20Large%20Language%20Models%20on%20Distributed%20Infrastructures:%20A%20Survey&rft.au=Duan,%20Jiangfei&rft.date=2024-07-29&rft_id=info:doi/10.48550/arxiv.2407.20018&rft_dat=%3Carxiv_GOX%3E2407_20018%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true