PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption

Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent through...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Tan, Yifan, Tan, Cheng, Mi, Zeyu, Chen, Haibo
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Cryptography and Security Computer Science - Distributed, Parallel, and Cluster Computing
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Tan, Yifan Tan, Cheng Mi, Zeyu Chen, Haibo
description	Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining - an idea inspired by the CPU instruction pipelining - thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments on NVIDIA H100 GPU show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (less than 19.6 percent in throughput) across various LLM sizes, from 13B to 175B.
doi_str_mv	10.48550/arxiv.2411.03357
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_03357</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_03357</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_033573</originalsourceid><addsrcrecordid>eNqFjrEOgkAQRK-xMOoHWLk_IIJANLYEYgGJCfZkhQU3OQ9yHCh_LxB7m5kpXiZPiK1jW97Z9-0D6g_31tFzHMt2Xf-0FI8bNxTHyQUibA2gKiCoVckFKcMoIUZd0Ziq6nAcSV2QhJR0zzm18GbzhLShvJNouCeY3iQrKiBUuR4aw7Vai0WJsqXNr1diF4X34LqfbbJG8wv1kE1W2Wzl_ie-iwFCzA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption</title><source>arXiv.org</source><creator>Tan, Yifan ; Tan, Cheng ; Mi, Zeyu ; Chen, Haibo</creator><creatorcontrib>Tan, Yifan ; Tan, Cheng ; Mi, Zeyu ; Chen, Haibo</creatorcontrib><description>Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining - an idea inspired by the CPU instruction pipelining - thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments on NVIDIA H100 GPU show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (less than 19.6 percent in throughput) across various LLM sizes, from 13B to 175B.</description><identifier>DOI: 10.48550/arxiv.2411.03357</identifier><language>eng</language><subject>Computer Science - Cryptography and Security ; Computer Science - Distributed, Parallel, and Cluster Computing</subject><creationdate>2024-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.03357$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.03357$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Tan, Yifan</creatorcontrib><creatorcontrib>Tan, Cheng</creatorcontrib><creatorcontrib>Mi, Zeyu</creatorcontrib><creatorcontrib>Chen, Haibo</creatorcontrib><title>PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption</title><description>Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining - an idea inspired by the CPU instruction pipelining - thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments on NVIDIA H100 GPU show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (less than 19.6 percent in throughput) across various LLM sizes, from 13B to 175B.</description><subject>Computer Science - Cryptography and Security</subject><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjrEOgkAQRK-xMOoHWLk_IIJANLYEYgGJCfZkhQU3OQ9yHCh_LxB7m5kpXiZPiK1jW97Z9-0D6g_31tFzHMt2Xf-0FI8bNxTHyQUibA2gKiCoVckFKcMoIUZd0Ziq6nAcSV2QhJR0zzm18GbzhLShvJNouCeY3iQrKiBUuR4aw7Vai0WJsqXNr1diF4X34LqfbbJG8wv1kE1W2Wzl_ie-iwFCzA</recordid><startdate>20241104</startdate><enddate>20241104</enddate><creator>Tan, Yifan</creator><creator>Tan, Cheng</creator><creator>Mi, Zeyu</creator><creator>Chen, Haibo</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241104</creationdate><title>PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption</title><author>Tan, Yifan ; Tan, Cheng ; Mi, Zeyu ; Chen, Haibo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_033573</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Cryptography and Security</topic><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Tan, Yifan</creatorcontrib><creatorcontrib>Tan, Cheng</creatorcontrib><creatorcontrib>Mi, Zeyu</creatorcontrib><creatorcontrib>Chen, Haibo</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tan, Yifan</au><au>Tan, Cheng</au><au>Mi, Zeyu</au><au>Chen, Haibo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption</atitle><date>2024-11-04</date><risdate>2024</risdate><abstract>Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining - an idea inspired by the CPU instruction pipelining - thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments on NVIDIA H100 GPU show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (less than 19.6 percent in throughput) across various LLM sizes, from 13B to 175B.</abstract><doi>10.48550/arxiv.2411.03357</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2411.03357
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2411_03357
source	arXiv.org
subjects	Computer Science - Cryptography and Security Computer Science - Distributed, Parallel, and Cluster Computing
title	PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T05%3A13%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PipeLLM:%20Fast%20and%20Confidential%20Large%20Language%20Model%20Services%20with%20Speculative%20Pipelined%20Encryption&rft.au=Tan,%20Yifan&rft.date=2024-11-04&rft_id=info:doi/10.48550/arxiv.2411.03357&rft_dat=%3Carxiv_GOX%3E2411_03357%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true