Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads

Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant inte...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hu, Cunchen, Huang, Heyang, Xu, Liangliang, Chen, Xusheng, Xu, Jiang, Chen, Shuang, Feng, Hao, Wang, Chenxi, Wang, Sa, Bao, Yungang, Sun, Ninghui, Shan, Yizhou
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Distributed, Parallel, and Cluster Computing
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Hu, Cunchen Huang, Heyang Xu, Liangliang Chen, Xusheng Xu, Jiang Chen, Shuang Feng, Hao Wang, Chenxi Wang, Sa Bao, Yungang Sun, Ninghui Shan, Yizhou
description	Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in TetriInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computationsaturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in turns of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.
doi_str_mv	10.48550/arxiv.2401.11181
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2401_11181</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2401_11181</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-9959acee0af18599526a1afd87a40777e7ff3d48a00e16958c0697d04b2ac1b3</originalsourceid><addsrcrecordid>eNpFj8tOwzAURL1hgQofwAr_QIJvYsc2O9TyiJSKBUiIVXQbX4eINkaOoeXvoaUSq9GMdEY6jF2AyKVRSlxh3A1feSEF5ABg4JS91qOnSGNHfDukt_CZeD0misfxmi-GCfs-Uo-JeNMs-T_gQ-TLYUeOL8J2nFIk3PCXEN_XAd10xk48ric6P-aMPd3dPs8fsubxvp7fNBlWGjJrlcWOSKAHo35bUSGgd0ajFFpr0t6XThoUgqCyynSistoJuSqwg1U5Y5d_rwe19iMOG4zf7V6xPSiWPyrlTJw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads</title><source>arXiv.org</source><creator>Hu, Cunchen ; Huang, Heyang ; Xu, Liangliang ; Chen, Xusheng ; Xu, Jiang ; Chen, Shuang ; Feng, Hao ; Wang, Chenxi ; Wang, Sa ; Bao, Yungang ; Sun, Ninghui ; Shan, Yizhou</creator><creatorcontrib>Hu, Cunchen ; Huang, Heyang ; Xu, Liangliang ; Chen, Xusheng ; Xu, Jiang ; Chen, Shuang ; Feng, Hao ; Wang, Chenxi ; Wang, Sa ; Bao, Yungang ; Sun, Ninghui ; Shan, Yizhou</creatorcontrib><description>Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in TetriInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computationsaturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in turns of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.</description><identifier>DOI: 10.48550/arxiv.2401.11181</identifier><language>eng</language><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><creationdate>2024-01</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2401.11181$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2401.11181$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Hu, Cunchen</creatorcontrib><creatorcontrib>Huang, Heyang</creatorcontrib><creatorcontrib>Xu, Liangliang</creatorcontrib><creatorcontrib>Chen, Xusheng</creatorcontrib><creatorcontrib>Xu, Jiang</creatorcontrib><creatorcontrib>Chen, Shuang</creatorcontrib><creatorcontrib>Feng, Hao</creatorcontrib><creatorcontrib>Wang, Chenxi</creatorcontrib><creatorcontrib>Wang, Sa</creatorcontrib><creatorcontrib>Bao, Yungang</creatorcontrib><creatorcontrib>Sun, Ninghui</creatorcontrib><creatorcontrib>Shan, Yizhou</creatorcontrib><title>Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads</title><description>Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in TetriInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computationsaturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in turns of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.</description><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpFj8tOwzAURL1hgQofwAr_QIJvYsc2O9TyiJSKBUiIVXQbX4eINkaOoeXvoaUSq9GMdEY6jF2AyKVRSlxh3A1feSEF5ABg4JS91qOnSGNHfDukt_CZeD0misfxmi-GCfs-Uo-JeNMs-T_gQ-TLYUeOL8J2nFIk3PCXEN_XAd10xk48ric6P-aMPd3dPs8fsubxvp7fNBlWGjJrlcWOSKAHo35bUSGgd0ajFFpr0t6XThoUgqCyynSistoJuSqwg1U5Y5d_rwe19iMOG4zf7V6xPSiWPyrlTJw</recordid><startdate>20240120</startdate><enddate>20240120</enddate><creator>Hu, Cunchen</creator><creator>Huang, Heyang</creator><creator>Xu, Liangliang</creator><creator>Chen, Xusheng</creator><creator>Xu, Jiang</creator><creator>Chen, Shuang</creator><creator>Feng, Hao</creator><creator>Wang, Chenxi</creator><creator>Wang, Sa</creator><creator>Bao, Yungang</creator><creator>Sun, Ninghui</creator><creator>Shan, Yizhou</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240120</creationdate><title>Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads</title><author>Hu, Cunchen ; Huang, Heyang ; Xu, Liangliang ; Chen, Xusheng ; Xu, Jiang ; Chen, Shuang ; Feng, Hao ; Wang, Chenxi ; Wang, Sa ; Bao, Yungang ; Sun, Ninghui ; Shan, Yizhou</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-9959acee0af18599526a1afd87a40777e7ff3d48a00e16958c0697d04b2ac1b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Hu, Cunchen</creatorcontrib><creatorcontrib>Huang, Heyang</creatorcontrib><creatorcontrib>Xu, Liangliang</creatorcontrib><creatorcontrib>Chen, Xusheng</creatorcontrib><creatorcontrib>Xu, Jiang</creatorcontrib><creatorcontrib>Chen, Shuang</creatorcontrib><creatorcontrib>Feng, Hao</creatorcontrib><creatorcontrib>Wang, Chenxi</creatorcontrib><creatorcontrib>Wang, Sa</creatorcontrib><creatorcontrib>Bao, Yungang</creatorcontrib><creatorcontrib>Sun, Ninghui</creatorcontrib><creatorcontrib>Shan, Yizhou</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hu, Cunchen</au><au>Huang, Heyang</au><au>Xu, Liangliang</au><au>Chen, Xusheng</au><au>Xu, Jiang</au><au>Chen, Shuang</au><au>Feng, Hao</au><au>Wang, Chenxi</au><au>Wang, Sa</au><au>Bao, Yungang</au><au>Sun, Ninghui</au><au>Shan, Yizhou</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads</atitle><date>2024-01-20</date><risdate>2024</risdate><abstract>Transformer-based large language model (LLM) inference serving is now the backbone of many cloud services. LLM inference consists of a prefill phase and a decode phase. However, existing LLM deployment practices often overlook the distinct characteristics of these phases, leading to significant interference. To mitigate interference, our insight is to carefully schedule and group inference requests based on their characteristics. We realize this idea in TetriInfer through three pillars. First, it partitions prompts into fixed-size chunks so that the accelerator always runs close to its computationsaturated limit. Second, it disaggregates prefill and decode instances so each can run independently. Finally, it uses a smart two-level scheduling algorithm augmented with predicted resource usage to avoid decode scheduling hotspots. Results show that TetriInfer improves time-to-first-token (TTFT), job completion time (JCT), and inference efficiency in turns of performance per dollar by a large margin, e.g., it uses 38% less resources all the while lowering average TTFT and average JCT by 97% and 47%, respectively.</abstract><doi>10.48550/arxiv.2401.11181</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2401.11181
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2401_11181
source	arXiv.org
subjects	Computer Science - Distributed, Parallel, and Cluster Computing
title	Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-08T13%3A31%3A44IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Inference%20without%20Interference:%20Disaggregate%20LLM%20Inference%20for%20Mixed%20Downstream%20Workloads&rft.au=Hu,%20Cunchen&rft.date=2024-01-20&rft_id=info:doi/10.48550/arxiv.2401.11181&rft_dat=%3Carxiv_GOX%3E2401_11181%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true