Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks

Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns abou...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on wireless communications 2024-11, p.1-1
Hauptverfasser:	Zhang, Xinyuan, Nie, Jiangtian, Huang, Yudong, Xie, Gaochang, Xiong, Zehui, Liu, Jiang, Niyato, Dusit, Shen, Xuemin Sherman
Format:	Artikel
Sprache:	eng
Schlagworte:	edge inference Generative AI multiuser edge computing wireless networks
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1
container_issue
container_start_page	1
container_title	IEEE transactions on wireless communications
container_volume
creator	Zhang, Xinyuan Nie, Jiangtian Huang, Yudong Xie, Gaochang Xiong, Zehui Liu, Jiang Niyato, Dusit Shen, Xuemin Sherman
description	Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.
doi_str_mv	10.1109/TWC.2024.3497923
format	Article
fullrecord	<record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TWC_2024_3497923</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10759588</ieee_id><sourcerecordid>10_1109_TWC_2024_3497923</sourcerecordid><originalsourceid>FETCH-LOGICAL-c628-e9331c9fce511234667adbe3ae5f058971bd355435c386af2d57baf7030153fd3</originalsourceid><addsrcrecordid>eNpNkDtPw0AQhE8IJEKgp6C4P-BwD-896MAKIVKAJlIkGuvi2wsGY6O7BJR_j62koJmdYmal-Qi55mzCObO3y1UxEUzkE5lbbYU8ISMOYDIhcnM6eKkyLrQ6JxcpfTDGtQIYkbcH3Hetp9t3pEXT7fwdnfoN0nkbMGJbIQ1dpDNsMbpt_YN04eJm0Hazc7157jw2idYtXdURG0yJvuD2t4uf6ZKcBdckvDreMVk-TpfFU7Z4nc2L-0VWKWEytFLyyoYKgXMhc6W082uUDiEwMFbztZcAuYRKGuWC8KDXLmgmWb8peDkm7PC2il1KEUP5HesvF_clZ-WApuzRlAOa8oimr9wcKjUi_otrsGCM_APHi1-q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Xinyuan ; Nie, Jiangtian ; Huang, Yudong ; Xie, Gaochang ; Xiong, Zehui ; Liu, Jiang ; Niyato, Dusit ; Shen, Xuemin Sherman</creator><creatorcontrib>Zhang, Xinyuan ; Nie, Jiangtian ; Huang, Yudong ; Xie, Gaochang ; Xiong, Zehui ; Liu, Jiang ; Niyato, Dusit ; Shen, Xuemin Sherman</creatorcontrib><description>Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.</description><identifier>ISSN: 1536-1276</identifier><identifier>EISSN: 1558-2248</identifier><identifier>DOI: 10.1109/TWC.2024.3497923</identifier><identifier>CODEN: ITWCAX</identifier><language>eng</language><publisher>IEEE</publisher><subject>edge inference ; Generative AI ; multiuser edge computing ; wireless networks</subject><ispartof>IEEE transactions on wireless communications, 2024-11, p.1-1</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-4440-941X ; 0000-0002-7178-9067 ; 0000-0002-7442-7416 ; 0000-0002-4140-287X ; 0000-0003-1414-0621 ; 0000-0003-2141-431X ; 0000-0002-0729-1299 ; 0000-0002-1561-3577</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10759588$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>315,781,785,797,27929,27930,54763</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10759588$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhang, Xinyuan</creatorcontrib><creatorcontrib>Nie, Jiangtian</creatorcontrib><creatorcontrib>Huang, Yudong</creatorcontrib><creatorcontrib>Xie, Gaochang</creatorcontrib><creatorcontrib>Xiong, Zehui</creatorcontrib><creatorcontrib>Liu, Jiang</creatorcontrib><creatorcontrib>Niyato, Dusit</creatorcontrib><creatorcontrib>Shen, Xuemin Sherman</creatorcontrib><title>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</title><title>IEEE transactions on wireless communications</title><addtitle>TWC</addtitle><description>Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.</description><subject>edge inference</subject><subject>Generative AI</subject><subject>multiuser edge computing</subject><subject>wireless networks</subject><issn>1536-1276</issn><issn>1558-2248</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkDtPw0AQhE8IJEKgp6C4P-BwD-896MAKIVKAJlIkGuvi2wsGY6O7BJR_j62koJmdYmal-Qi55mzCObO3y1UxEUzkE5lbbYU8ISMOYDIhcnM6eKkyLrQ6JxcpfTDGtQIYkbcH3Hetp9t3pEXT7fwdnfoN0nkbMGJbIQ1dpDNsMbpt_YN04eJm0Hazc7157jw2idYtXdURG0yJvuD2t4uf6ZKcBdckvDreMVk-TpfFU7Z4nc2L-0VWKWEytFLyyoYKgXMhc6W082uUDiEwMFbztZcAuYRKGuWC8KDXLmgmWb8peDkm7PC2il1KEUP5HesvF_clZ-WApuzRlAOa8oimr9wcKjUi_otrsGCM_APHi1-q</recordid><startdate>20241120</startdate><enddate>20241120</enddate><creator>Zhang, Xinyuan</creator><creator>Nie, Jiangtian</creator><creator>Huang, Yudong</creator><creator>Xie, Gaochang</creator><creator>Xiong, Zehui</creator><creator>Liu, Jiang</creator><creator>Niyato, Dusit</creator><creator>Shen, Xuemin Sherman</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-4440-941X</orcidid><orcidid>https://orcid.org/0000-0002-7178-9067</orcidid><orcidid>https://orcid.org/0000-0002-7442-7416</orcidid><orcidid>https://orcid.org/0000-0002-4140-287X</orcidid><orcidid>https://orcid.org/0000-0003-1414-0621</orcidid><orcidid>https://orcid.org/0000-0003-2141-431X</orcidid><orcidid>https://orcid.org/0000-0002-0729-1299</orcidid><orcidid>https://orcid.org/0000-0002-1561-3577</orcidid></search><sort><creationdate>20241120</creationdate><title>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</title><author>Zhang, Xinyuan ; Nie, Jiangtian ; Huang, Yudong ; Xie, Gaochang ; Xiong, Zehui ; Liu, Jiang ; Niyato, Dusit ; Shen, Xuemin Sherman</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c628-e9331c9fce511234667adbe3ae5f058971bd355435c386af2d57baf7030153fd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>edge inference</topic><topic>Generative AI</topic><topic>multiuser edge computing</topic><topic>wireless networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Xinyuan</creatorcontrib><creatorcontrib>Nie, Jiangtian</creatorcontrib><creatorcontrib>Huang, Yudong</creatorcontrib><creatorcontrib>Xie, Gaochang</creatorcontrib><creatorcontrib>Xiong, Zehui</creatorcontrib><creatorcontrib>Liu, Jiang</creatorcontrib><creatorcontrib>Niyato, Dusit</creatorcontrib><creatorcontrib>Shen, Xuemin Sherman</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on wireless communications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Xinyuan</au><au>Nie, Jiangtian</au><au>Huang, Yudong</au><au>Xie, Gaochang</au><au>Xiong, Zehui</au><au>Liu, Jiang</au><au>Niyato, Dusit</au><au>Shen, Xuemin Sherman</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</atitle><jtitle>IEEE transactions on wireless communications</jtitle><stitle>TWC</stitle><date>2024-11-20</date><risdate>2024</risdate><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1536-1276</issn><eissn>1558-2248</eissn><coden>ITWCAX</coden><abstract>Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.</abstract><pub>IEEE</pub><doi>10.1109/TWC.2024.3497923</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-4440-941X</orcidid><orcidid>https://orcid.org/0000-0002-7178-9067</orcidid><orcidid>https://orcid.org/0000-0002-7442-7416</orcidid><orcidid>https://orcid.org/0000-0002-4140-287X</orcidid><orcidid>https://orcid.org/0000-0003-1414-0621</orcidid><orcidid>https://orcid.org/0000-0003-2141-431X</orcidid><orcidid>https://orcid.org/0000-0002-0729-1299</orcidid><orcidid>https://orcid.org/0000-0002-1561-3577</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1536-1276
ispartof	IEEE transactions on wireless communications, 2024-11, p.1-1
issn	1536-1276 1558-2248
language	eng
recordid	cdi_crossref_primary_10_1109_TWC_2024_3497923
source	IEEE Electronic Library (IEL)
subjects	edge inference Generative AI multiuser edge computing wireless networks
title	Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T09%3A45%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Beyond%20the%20Cloud:%20Edge%20Inference%20for%20Generative%20Large%20Language%20Models%20in%20Wireless%20Networks&rft.jtitle=IEEE%20transactions%20on%20wireless%20communications&rft.au=Zhang,%20Xinyuan&rft.date=2024-11-20&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1536-1276&rft.eissn=1558-2248&rft.coden=ITWCAX&rft_id=info:doi/10.1109/TWC.2024.3497923&rft_dat=%3Ccrossref_RIE%3E10_1109_TWC_2024_3497923%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10759588&rfr_iscdi=true