Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI industry with their sophisticated capabilities. Training these models requires vast GPU clusters and significant computing time, posing major challenges in terms of scalability, efficiency, and reliability. This survey explor...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Duan, Jiangfei Zhang, Shuo Wang, Zerui Jiang, Lijuan Qu, Wenwen Hu, Qinghao Wang, Guoteng Weng, Qizhen Yan, Hang Zhang, Xingcheng Qiu, Xipeng Lin, Dahua Wen, Yonggang Jin, Xin Zhang, Tianwei Sun, Peng |
description | Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI
industry with their sophisticated capabilities. Training these models requires
vast GPU clusters and significant computing time, posing major challenges in
terms of scalability, efficiency, and reliability. This survey explores recent
advancements in training systems for LLMs, including innovations in training
infrastructure with AI accelerators, networking, storage, and scheduling.
Additionally, the survey covers parallelism strategies, as well as
optimizations for computation, communication, and memory in distributed LLM
training. It also includes approaches of maintaining system reliability over
extended training periods. By examining current innovations and future
directions, this survey aims to provide valuable insights towards improving LLM
training systems and tackling ongoing challenges. Furthermore, traditional
digital circuit-based computing systems face significant constraints in meeting
the computational demands of LLMs, highlighting the need for innovative
solutions such as optical computing and optical networks. |
doi_str_mv | 10.48550/arxiv.2407.20018 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2407_20018</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2407_20018</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2407_200183</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1zMyMDC04GQIcU1Ly0zOTM0rUQgpSszMy8xLV8hPU_BJLEpPBZJ56aWJQIZvfkpqTrFCfp6CS2ZxSVFmUmlJaoqCZ15aUSKQW5pcUlqUWmyl4KgQXFpUllrJw8CalphTnMoLpbkZ5N1cQ5w9dMH2xxcUZeYmFlXGg9wRD3aHMWEVAOyJPd4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Efficient Training of Large Language Models on Distributed Infrastructures: A Survey</title><source>arXiv.org</source><creator>Duan, Jiangfei ; Zhang, Shuo ; Wang, Zerui ; Jiang, Lijuan ; Qu, Wenwen ; Hu, Qinghao ; Wang, Guoteng ; Weng, Qizhen ; Yan, Hang ; Zhang, Xingcheng ; Qiu, Xipeng ; Lin, Dahua ; Wen, Yonggang ; Jin, Xin ; Zhang, Tianwei ; Sun, Peng</creator><creatorcontrib>Duan, Jiangfei ; Zhang, Shuo ; Wang, Zerui ; Jiang, Lijuan ; Qu, Wenwen ; Hu, Qinghao ; Wang, Guoteng ; Weng, Qizhen ; Yan, Hang ; Zhang, Xingcheng ; Qiu, Xipeng ; Lin, Dahua ; Wen, Yonggang ; Jin, Xin ; Zhang, Tianwei ; Sun, Peng</creatorcontrib><description>Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI
industry with their sophisticated capabilities. Training these models requires
vast GPU clusters and significant computing time, posing major challenges in
terms of scalability, efficiency, and reliability. This survey explores recent
advancements in training systems for LLMs, including innovations in training
infrastructure with AI accelerators, networking, storage, and scheduling.
Additionally, the survey covers parallelism strategies, as well as
optimizations for computation, communication, and memory in distributed LLM
training. It also includes approaches of maintaining system reliability over
extended training periods. By examining current innovations and future
directions, this survey aims to provide valuable insights towards improving LLM
training systems and tackling ongoing challenges. Furthermore, traditional
digital circuit-based computing systems face significant constraints in meeting
the computational demands of LLMs, highlighting the need for innovative
solutions such as optical computing and optical networks.</description><identifier>DOI: 10.48550/arxiv.2407.20018</identifier><language>eng</language><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><creationdate>2024-07</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2407.20018$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2407.20018$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Duan, Jiangfei</creatorcontrib><creatorcontrib>Zhang, Shuo</creatorcontrib><creatorcontrib>Wang, Zerui</creatorcontrib><creatorcontrib>Jiang, Lijuan</creatorcontrib><creatorcontrib>Qu, Wenwen</creatorcontrib><creatorcontrib>Hu, Qinghao</creatorcontrib><creatorcontrib>Wang, Guoteng</creatorcontrib><creatorcontrib>Weng, Qizhen</creatorcontrib><creatorcontrib>Yan, Hang</creatorcontrib><creatorcontrib>Zhang, Xingcheng</creatorcontrib><creatorcontrib>Qiu, Xipeng</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Wen, Yonggang</creatorcontrib><creatorcontrib>Jin, Xin</creatorcontrib><creatorcontrib>Zhang, Tianwei</creatorcontrib><creatorcontrib>Sun, Peng</creatorcontrib><title>Efficient Training of Large Language Models on Distributed Infrastructures: A Survey</title><description>Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI
industry with their sophisticated capabilities. Training these models requires
vast GPU clusters and significant computing time, posing major challenges in
terms of scalability, efficiency, and reliability. This survey explores recent
advancements in training systems for LLMs, including innovations in training
infrastructure with AI accelerators, networking, storage, and scheduling.
Additionally, the survey covers parallelism strategies, as well as
optimizations for computation, communication, and memory in distributed LLM
training. It also includes approaches of maintaining system reliability over
extended training periods. By examining current innovations and future
directions, this survey aims to provide valuable insights towards improving LLM
training systems and tackling ongoing challenges. Furthermore, traditional
digital circuit-based computing systems face significant constraints in meeting
the computational demands of LLMs, highlighting the need for innovative
solutions such as optical computing and optical networks.</description><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1zMyMDC04GQIcU1Ly0zOTM0rUQgpSszMy8xLV8hPU_BJLEpPBZJ56aWJQIZvfkpqTrFCfp6CS2ZxSVFmUmlJaoqCZ15aUSKQW5pcUlqUWmyl4KgQXFpUllrJw8CalphTnMoLpbkZ5N1cQ5w9dMH2xxcUZeYmFlXGg9wRD3aHMWEVAOyJPd4</recordid><startdate>20240729</startdate><enddate>20240729</enddate><creator>Duan, Jiangfei</creator><creator>Zhang, Shuo</creator><creator>Wang, Zerui</creator><creator>Jiang, Lijuan</creator><creator>Qu, Wenwen</creator><creator>Hu, Qinghao</creator><creator>Wang, Guoteng</creator><creator>Weng, Qizhen</creator><creator>Yan, Hang</creator><creator>Zhang, Xingcheng</creator><creator>Qiu, Xipeng</creator><creator>Lin, Dahua</creator><creator>Wen, Yonggang</creator><creator>Jin, Xin</creator><creator>Zhang, Tianwei</creator><creator>Sun, Peng</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240729</creationdate><title>Efficient Training of Large Language Models on Distributed Infrastructures: A Survey</title><author>Duan, Jiangfei ; Zhang, Shuo ; Wang, Zerui ; Jiang, Lijuan ; Qu, Wenwen ; Hu, Qinghao ; Wang, Guoteng ; Weng, Qizhen ; Yan, Hang ; Zhang, Xingcheng ; Qiu, Xipeng ; Lin, Dahua ; Wen, Yonggang ; Jin, Xin ; Zhang, Tianwei ; Sun, Peng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2407_200183</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Duan, Jiangfei</creatorcontrib><creatorcontrib>Zhang, Shuo</creatorcontrib><creatorcontrib>Wang, Zerui</creatorcontrib><creatorcontrib>Jiang, Lijuan</creatorcontrib><creatorcontrib>Qu, Wenwen</creatorcontrib><creatorcontrib>Hu, Qinghao</creatorcontrib><creatorcontrib>Wang, Guoteng</creatorcontrib><creatorcontrib>Weng, Qizhen</creatorcontrib><creatorcontrib>Yan, Hang</creatorcontrib><creatorcontrib>Zhang, Xingcheng</creatorcontrib><creatorcontrib>Qiu, Xipeng</creatorcontrib><creatorcontrib>Lin, Dahua</creatorcontrib><creatorcontrib>Wen, Yonggang</creatorcontrib><creatorcontrib>Jin, Xin</creatorcontrib><creatorcontrib>Zhang, Tianwei</creatorcontrib><creatorcontrib>Sun, Peng</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Duan, Jiangfei</au><au>Zhang, Shuo</au><au>Wang, Zerui</au><au>Jiang, Lijuan</au><au>Qu, Wenwen</au><au>Hu, Qinghao</au><au>Wang, Guoteng</au><au>Weng, Qizhen</au><au>Yan, Hang</au><au>Zhang, Xingcheng</au><au>Qiu, Xipeng</au><au>Lin, Dahua</au><au>Wen, Yonggang</au><au>Jin, Xin</au><au>Zhang, Tianwei</au><au>Sun, Peng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Efficient Training of Large Language Models on Distributed Infrastructures: A Survey</atitle><date>2024-07-29</date><risdate>2024</risdate><abstract>Large Language Models (LLMs) like GPT and LLaMA are revolutionizing the AI
industry with their sophisticated capabilities. Training these models requires
vast GPU clusters and significant computing time, posing major challenges in
terms of scalability, efficiency, and reliability. This survey explores recent
advancements in training systems for LLMs, including innovations in training
infrastructure with AI accelerators, networking, storage, and scheduling.
Additionally, the survey covers parallelism strategies, as well as
optimizations for computation, communication, and memory in distributed LLM
training. It also includes approaches of maintaining system reliability over
extended training periods. By examining current innovations and future
directions, this survey aims to provide valuable insights towards improving LLM
training systems and tackling ongoing challenges. Furthermore, traditional
digital circuit-based computing systems face significant constraints in meeting
the computational demands of LLMs, highlighting the need for innovative
solutions such as optical computing and optical networks.</abstract><doi>10.48550/arxiv.2407.20018</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2407.20018 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2407_20018 |
source | arXiv.org |
subjects | Computer Science - Distributed, Parallel, and Cluster Computing |
title | Efficient Training of Large Language Models on Distributed Infrastructures: A Survey |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-03T17%3A22%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Efficient%20Training%20of%20Large%20Language%20Models%20on%20Distributed%20Infrastructures:%20A%20Survey&rft.au=Duan,%20Jiangfei&rft.date=2024-07-29&rft_id=info:doi/10.48550/arxiv.2407.20018&rft_dat=%3Carxiv_GOX%3E2407_20018%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |