Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks
Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns abou...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on wireless communications 2024-11, p.1-1 |
---|---|
Hauptverfasser: | , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1 |
---|---|
container_issue | |
container_start_page | 1 |
container_title | IEEE transactions on wireless communications |
container_volume | |
creator | Zhang, Xinyuan Nie, Jiangtian Huang, Yudong Xie, Gaochang Xiong, Zehui Liu, Jiang Niyato, Dusit Shen, Xuemin Sherman |
description | Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching. |
doi_str_mv | 10.1109/TWC.2024.3497923 |
format | Article |
fullrecord | <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TWC_2024_3497923</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10759588</ieee_id><sourcerecordid>10_1109_TWC_2024_3497923</sourcerecordid><originalsourceid>FETCH-LOGICAL-c628-e9331c9fce511234667adbe3ae5f058971bd355435c386af2d57baf7030153fd3</originalsourceid><addsrcrecordid>eNpNkDtPw0AQhE8IJEKgp6C4P-BwD-896MAKIVKAJlIkGuvi2wsGY6O7BJR_j62koJmdYmal-Qi55mzCObO3y1UxEUzkE5lbbYU8ISMOYDIhcnM6eKkyLrQ6JxcpfTDGtQIYkbcH3Hetp9t3pEXT7fwdnfoN0nkbMGJbIQ1dpDNsMbpt_YN04eJm0Hazc7157jw2idYtXdURG0yJvuD2t4uf6ZKcBdckvDreMVk-TpfFU7Z4nc2L-0VWKWEytFLyyoYKgXMhc6W082uUDiEwMFbztZcAuYRKGuWC8KDXLmgmWb8peDkm7PC2il1KEUP5HesvF_clZ-WApuzRlAOa8oimr9wcKjUi_otrsGCM_APHi1-q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</title><source>IEEE Electronic Library (IEL)</source><creator>Zhang, Xinyuan ; Nie, Jiangtian ; Huang, Yudong ; Xie, Gaochang ; Xiong, Zehui ; Liu, Jiang ; Niyato, Dusit ; Shen, Xuemin Sherman</creator><creatorcontrib>Zhang, Xinyuan ; Nie, Jiangtian ; Huang, Yudong ; Xie, Gaochang ; Xiong, Zehui ; Liu, Jiang ; Niyato, Dusit ; Shen, Xuemin Sherman</creatorcontrib><description>Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.</description><identifier>ISSN: 1536-1276</identifier><identifier>EISSN: 1558-2248</identifier><identifier>DOI: 10.1109/TWC.2024.3497923</identifier><identifier>CODEN: ITWCAX</identifier><language>eng</language><publisher>IEEE</publisher><subject>edge inference ; Generative AI ; multiuser edge computing ; wireless networks</subject><ispartof>IEEE transactions on wireless communications, 2024-11, p.1-1</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-4440-941X ; 0000-0002-7178-9067 ; 0000-0002-7442-7416 ; 0000-0002-4140-287X ; 0000-0003-1414-0621 ; 0000-0003-2141-431X ; 0000-0002-0729-1299 ; 0000-0002-1561-3577</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10759588$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>315,781,785,797,27929,27930,54763</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10759588$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhang, Xinyuan</creatorcontrib><creatorcontrib>Nie, Jiangtian</creatorcontrib><creatorcontrib>Huang, Yudong</creatorcontrib><creatorcontrib>Xie, Gaochang</creatorcontrib><creatorcontrib>Xiong, Zehui</creatorcontrib><creatorcontrib>Liu, Jiang</creatorcontrib><creatorcontrib>Niyato, Dusit</creatorcontrib><creatorcontrib>Shen, Xuemin Sherman</creatorcontrib><title>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</title><title>IEEE transactions on wireless communications</title><addtitle>TWC</addtitle><description>Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.</description><subject>edge inference</subject><subject>Generative AI</subject><subject>multiuser edge computing</subject><subject>wireless networks</subject><issn>1536-1276</issn><issn>1558-2248</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkDtPw0AQhE8IJEKgp6C4P-BwD-896MAKIVKAJlIkGuvi2wsGY6O7BJR_j62koJmdYmal-Qi55mzCObO3y1UxEUzkE5lbbYU8ISMOYDIhcnM6eKkyLrQ6JxcpfTDGtQIYkbcH3Hetp9t3pEXT7fwdnfoN0nkbMGJbIQ1dpDNsMbpt_YN04eJm0Hazc7157jw2idYtXdURG0yJvuD2t4uf6ZKcBdckvDreMVk-TpfFU7Z4nc2L-0VWKWEytFLyyoYKgXMhc6W082uUDiEwMFbztZcAuYRKGuWC8KDXLmgmWb8peDkm7PC2il1KEUP5HesvF_clZ-WApuzRlAOa8oimr9wcKjUi_otrsGCM_APHi1-q</recordid><startdate>20241120</startdate><enddate>20241120</enddate><creator>Zhang, Xinyuan</creator><creator>Nie, Jiangtian</creator><creator>Huang, Yudong</creator><creator>Xie, Gaochang</creator><creator>Xiong, Zehui</creator><creator>Liu, Jiang</creator><creator>Niyato, Dusit</creator><creator>Shen, Xuemin Sherman</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-4440-941X</orcidid><orcidid>https://orcid.org/0000-0002-7178-9067</orcidid><orcidid>https://orcid.org/0000-0002-7442-7416</orcidid><orcidid>https://orcid.org/0000-0002-4140-287X</orcidid><orcidid>https://orcid.org/0000-0003-1414-0621</orcidid><orcidid>https://orcid.org/0000-0003-2141-431X</orcidid><orcidid>https://orcid.org/0000-0002-0729-1299</orcidid><orcidid>https://orcid.org/0000-0002-1561-3577</orcidid></search><sort><creationdate>20241120</creationdate><title>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</title><author>Zhang, Xinyuan ; Nie, Jiangtian ; Huang, Yudong ; Xie, Gaochang ; Xiong, Zehui ; Liu, Jiang ; Niyato, Dusit ; Shen, Xuemin Sherman</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c628-e9331c9fce511234667adbe3ae5f058971bd355435c386af2d57baf7030153fd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>edge inference</topic><topic>Generative AI</topic><topic>multiuser edge computing</topic><topic>wireless networks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Xinyuan</creatorcontrib><creatorcontrib>Nie, Jiangtian</creatorcontrib><creatorcontrib>Huang, Yudong</creatorcontrib><creatorcontrib>Xie, Gaochang</creatorcontrib><creatorcontrib>Xiong, Zehui</creatorcontrib><creatorcontrib>Liu, Jiang</creatorcontrib><creatorcontrib>Niyato, Dusit</creatorcontrib><creatorcontrib>Shen, Xuemin Sherman</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on wireless communications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Xinyuan</au><au>Nie, Jiangtian</au><au>Huang, Yudong</au><au>Xie, Gaochang</au><au>Xiong, Zehui</au><au>Liu, Jiang</au><au>Niyato, Dusit</au><au>Shen, Xuemin Sherman</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks</atitle><jtitle>IEEE transactions on wireless communications</jtitle><stitle>TWC</stitle><date>2024-11-20</date><risdate>2024</risdate><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1536-1276</issn><eissn>1558-2248</eissn><coden>ITWCAX</coden><abstract>Generative Artificial Intelligenge (GAI) is revolutionizing the world with its unprecedented content creation ability. Large Language Model (LLM) is one of its most embraced branches. However, due to LLM's substantial size and resource-intensive nature, it is cloud-hosted, raising concerns about privacy, usage limitations, and latency. In this paper, we propose to utilize ubiquitous distributed wireless edge computing resources for real-time LLM inference. Specifically, we introduce a novel LLM edge inference framework, incorporating batching and model quantization to ensure high throughput inference on resource-limited edge devices. Then, based on the architecture of transformer decoder-based LLMs, we formulate an edge inference optimization problem which is NP-hard, considering batch scheduling and joint allocation of communication and computation resources. The solution is the optimal throughput under edge resource constraints and heterogeneous user requirements on latency and accuracy. To solve this NP-hard problem, we develop an OT-GAH (Optimal Tree-search with Generalized Assignment Heuristics) algorithm with reasonable complexity and 1/2-approximation ratio. We first design the OT algorithm with online tree-pruning for single-edge-node multiuser case, which navigates the inference request selection within the tree structure to miximize throughput. We then consider the multi-edge-node case and propose the GAH algorithm, which recrusively invokes the OT in each node's inference scheduling iteration. Simulation results demonstrate the superiority of OT-GAH batching over other benchmarks, revealing an over 45% time complexity reduction compared to brute-force searching.</abstract><pub>IEEE</pub><doi>10.1109/TWC.2024.3497923</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-4440-941X</orcidid><orcidid>https://orcid.org/0000-0002-7178-9067</orcidid><orcidid>https://orcid.org/0000-0002-7442-7416</orcidid><orcidid>https://orcid.org/0000-0002-4140-287X</orcidid><orcidid>https://orcid.org/0000-0003-1414-0621</orcidid><orcidid>https://orcid.org/0000-0003-2141-431X</orcidid><orcidid>https://orcid.org/0000-0002-0729-1299</orcidid><orcidid>https://orcid.org/0000-0002-1561-3577</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1536-1276 |
ispartof | IEEE transactions on wireless communications, 2024-11, p.1-1 |
issn | 1536-1276 1558-2248 |
language | eng |
recordid | cdi_crossref_primary_10_1109_TWC_2024_3497923 |
source | IEEE Electronic Library (IEL) |
subjects | edge inference Generative AI multiuser edge computing wireless networks |
title | Beyond the Cloud: Edge Inference for Generative Large Language Models in Wireless Networks |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T09%3A45%3A51IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Beyond%20the%20Cloud:%20Edge%20Inference%20for%20Generative%20Large%20Language%20Models%20in%20Wireless%20Networks&rft.jtitle=IEEE%20transactions%20on%20wireless%20communications&rft.au=Zhang,%20Xinyuan&rft.date=2024-11-20&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1536-1276&rft.eissn=1558-2248&rft.coden=ITWCAX&rft_id=info:doi/10.1109/TWC.2024.3497923&rft_dat=%3Ccrossref_RIE%3E10_1109_TWC_2024_3497923%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10759588&rfr_iscdi=true |