Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks

Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns abou...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on wireless communications 2024-11, p.1-1
Hauptverfasser: Zhang, Xinyuan, Nie, Jiangtian, Huang, Yudong, Xie, Gaochang, Xiong, Zehui, Liu, Jiang, Niyato, Dusit, Shen, Xuemin Sherman
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1
container_issue
container_start_page 1
container_title IEEE transactions on wireless communications
container_volume
creator Zhang, Xinyuan
Nie, Jiangtian
Huang, Yudong
Xie, Gaochang
Xiong, Zehui
Liu, Jiang
Niyato, Dusit
Shen, Xuemin Sherman
description Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.
doi_str_mv 10.1109/TWC.2024.3497923
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TWC_2024_3497923</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10759588</ieee_id><sourcerecordid>10_1109_TWC_2024_3497923</sourcerecordid><originalsourceid>FETCH-LOGICAL-c628-e9331c9fce511234667adbe3ae5f058971bd355435c386af2d57baf7030153fd3</originalsourceid><addsrcrecordid>eNpNkDtPw0AQhE8IJEKgp6C4P-BwD-896MAKIVKAJlIkGuvi2wsGY6O7BJR_j62koJmdYmal-Qi55mzCObO3y1UxEUzkE5lbbYU8ISMOYDIhcnM6eKkyLrQ6JxcpfTDGtQIYkbcH3Hetp9t3pEXT7fwdnfoN0nkbMGJbIQ1dpDNsMbpt_YN04eJm0Hazc7157jw2idYtXdURG0yJvuD2t4uf6ZKcBdckvDreMVk-TpfFU7Z4nc2L-0VWKWEytFLyyoYKgXMhc6W082uUDiEwMFbztZcAuYRKGuWC8KDXLmgmWb8peDkm7PC2il1KEUP5HesvF_clZ-WApuzRlAOa8oimr9wcKjUi_otrsGCM_APHi1-q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Xinyuan ; Nie, Jiangtian ; Huang, Yudong ; Xie, Gaochang ; Xiong, Zehui ; Liu, Jiang ; Niyato, Dusit ; Shen, Xuemin Sherman</creator><creatorcontrib>Zhang, Xinyuan ; Nie, Jiangtian ; Huang, Yudong ; Xie, Gaochang ; Xiong, Zehui ; Liu, Jiang ; Niyato, Dusit ; Shen, Xuemin Sherman</creatorcontrib><description>Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.</description><identifier>ISSN: 1536-1276</identifier><identifier>EISSN: 1558-2248</identifier><identifier>DOI: 10.1109/TWC.2024.3497923</identifier><identifier>CODEN: ITWCAX</identifier><language>eng</language><publisher>IEEE</publisher><subject>edge inference ; Generative AI ; multiuser edge computing ; wireless networks</subject><ispartof>IEEE transactions on wireless communications, 2024-11, p.1-1</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-4440-941X ; 0000-0002-7178-9067 ; 0000-0002-7442-7416 ; 0000-0002-4140-287X ; 0000-0003-1414-0621 ; 0000-0003-2141-431X ; 0000-0002-0729-1299 ; 0000-0002-1561-3577</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10759588$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>315,781,785,797,27929,27930,54763</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10759588$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhang, Xinyuan</creatorcontrib><creatorcontrib>Nie, Jiangtian</creatorcontrib><creatorcontrib>Huang, Yudong</creatorcontrib><creatorcontrib>Xie, Gaochang</creatorcontrib><creatorcontrib>Xiong, Zehui</creatorcontrib><creatorcontrib>Liu, Jiang</creatorcontrib><creatorcontrib>Niyato, Dusit</creatorcontrib><creatorcontrib>Shen, Xuemin Sherman</creatorcontrib><title>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</title><title>IEEE transactions on wireless communications</title><addtitle>TWC</addtitle><description>Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.</description><subject>edge inference</subject><subject>Generative AI</subject><subject>multiuser edge computing</subject><subject>wireless networks</subject><issn>1536-1276</issn><issn>1558-2248</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkDtPw0AQhE8IJEKgp6C4P-BwD-896MAKIVKAJlIkGuvi2wsGY6O7BJR_j62koJmdYmal-Qi55mzCObO3y1UxEUzkE5lbbYU8ISMOYDIhcnM6eKkyLrQ6JxcpfTDGtQIYkbcH3Hetp9t3pEXT7fwdnfoN0nkbMGJbIQ1dpDNsMbpt_YN04eJm0Hazc7157jw2idYtXdURG0yJvuD2t4uf6ZKcBdckvDreMVk-TpfFU7Z4nc2L-0VWKWEytFLyyoYKgXMhc6W082uUDiEwMFbztZcAuYRKGuWC8KDXLmgmWb8peDkm7PC2il1KEUP5HesvF_clZ-WApuzRlAOa8oimr9wcKjUi_otrsGCM_APHi1-q</recordid><startdate>20241120</startdate><enddate>20241120</enddate><creator>Zhang, Xinyuan</creator><creator>Nie, Jiangtian</creator><creator>Huang, Yudong</creator><creator>Xie, Gaochang</creator><creator>Xiong, Zehui</creator><creator>Liu, Jiang</creator><creator>Niyato, Dusit</creator><creator>Shen, Xuemin Sherman</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-4440-941X</orcidid><orcidid>https://orcid.org/0000-0002-7178-9067</orcidid><orcidid>https://orcid.org/0000-0002-7442-7416</orcidid><orcidid>https://orcid.org/0000-0002-4140-287X</orcidid><orcidid>https://orcid.org/0000-0003-1414-0621</orcidid><orcidid>https://orcid.org/0000-0003-2141-431X</orcidid><orcidid>https://orcid.org/0000-0002-0729-1299</orcidid><orcidid>https://orcid.org/0000-0002-1561-3577</orcidid></search><sort><creationdate>20241120</creationdate><title>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</title><author>Zhang, Xinyuan ; Nie, Jiangtian ; Huang, Yudong ; Xie, Gaochang ; Xiong, Zehui ; Liu, Jiang ; Niyato, Dusit ; Shen, Xuemin Sherman</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c628-e9331c9fce511234667adbe3ae5f058971bd355435c386af2d57baf7030153fd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>edge inference</topic><topic>Generative AI</topic><topic>multiuser edge computing</topic><topic>wireless networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Xinyuan</creatorcontrib><creatorcontrib>Nie, Jiangtian</creatorcontrib><creatorcontrib>Huang, Yudong</creatorcontrib><creatorcontrib>Xie, Gaochang</creatorcontrib><creatorcontrib>Xiong, Zehui</creatorcontrib><creatorcontrib>Liu, Jiang</creatorcontrib><creatorcontrib>Niyato, Dusit</creatorcontrib><creatorcontrib>Shen, Xuemin Sherman</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on wireless communications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Xinyuan</au><au>Nie, Jiangtian</au><au>Huang, Yudong</au><au>Xie, Gaochang</au><au>Xiong, Zehui</au><au>Liu, Jiang</au><au>Niyato, Dusit</au><au>Shen, Xuemin Sherman</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</atitle><jtitle>IEEE transactions on wireless communications</jtitle><stitle>TWC</stitle><date>2024-11-20</date><risdate>2024</risdate><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1536-1276</issn><eissn>1558-2248</eissn><coden>ITWCAX</coden><abstract>Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.</abstract><pub>IEEE</pub><doi>10.1109/TWC.2024.3497923</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-4440-941X</orcidid><orcidid>https://orcid.org/0000-0002-7178-9067</orcidid><orcidid>https://orcid.org/0000-0002-7442-7416</orcidid><orcidid>https://orcid.org/0000-0002-4140-287X</orcidid><orcidid>https://orcid.org/0000-0003-1414-0621</orcidid><orcidid>https://orcid.org/0000-0003-2141-431X</orcidid><orcidid>https://orcid.org/0000-0002-0729-1299</orcidid><orcidid>https://orcid.org/0000-0002-1561-3577</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1536-1276
ispartof IEEE transactions on wireless communications, 2024-11, p.1-1
issn 1536-1276
1558-2248
language eng
recordid cdi_crossref_primary_10_1109_TWC_2024_3497923
source IEEE Electronic Library (IEL)
subjects edge inference
Generative AI
multiuser edge computing
wireless networks
title Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T09%3A45%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Beyond%20the%20Cloud:%20Edge%20Inference%20for%20Generative%20Large%20Language%20Models%20in%20Wireless%20Networks&rft.jtitle=IEEE%20transactions%20on%20wireless%20communications&rft.au=Zhang,%20Xinyuan&rft.date=2024-11-20&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1536-1276&rft.eissn=1558-2248&rft.coden=ITWCAX&rft_id=info:doi/10.1109/TWC.2024.3497923&rft_dat=%3Ccrossref_RIE%3E10_1109_TWC_2024_3497923%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10759588&rfr_iscdi=true