PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption

Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent through...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Tan, Yifan, Tan, Cheng, Mi, Zeyu, Chen, Haibo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Tan, Yifan
Tan, Cheng
Mi, Zeyu
Chen, Haibo
description Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining - an idea inspired by the CPU instruction pipelining - thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments on NVIDIA H100 GPU show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (less than 19.6 percent in throughput) across various LLM sizes, from 13B to 175B.
doi_str_mv 10.48550/arxiv.2411.03357
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2411_03357</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2411_03357</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2411_033573</originalsourceid><addsrcrecordid>eNqFjrEOgkAQRK-xMOoHWLk_IIJANLYEYgGJCfZkhQU3OQ9yHCh_LxB7m5kpXiZPiK1jW97Z9-0D6g_31tFzHMt2Xf-0FI8bNxTHyQUibA2gKiCoVckFKcMoIUZd0Ziq6nAcSV2QhJR0zzm18GbzhLShvJNouCeY3iQrKiBUuR4aw7Vai0WJsqXNr1diF4X34LqfbbJG8wv1kE1W2Wzl_ie-iwFCzA</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption</title><source>arXiv.org</source><creator>Tan, Yifan ; Tan, Cheng ; Mi, Zeyu ; Chen, Haibo</creator><creatorcontrib>Tan, Yifan ; Tan, Cheng ; Mi, Zeyu ; Chen, Haibo</creatorcontrib><description>Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining - an idea inspired by the CPU instruction pipelining - thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments on NVIDIA H100 GPU show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (less than 19.6 percent in throughput) across various LLM sizes, from 13B to 175B.</description><identifier>DOI: 10.48550/arxiv.2411.03357</identifier><language>eng</language><subject>Computer Science - Cryptography and Security ; Computer Science - Distributed, Parallel, and Cluster Computing</subject><creationdate>2024-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2411.03357$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2411.03357$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Tan, Yifan</creatorcontrib><creatorcontrib>Tan, Cheng</creatorcontrib><creatorcontrib>Mi, Zeyu</creatorcontrib><creatorcontrib>Chen, Haibo</creatorcontrib><title>PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption</title><description>Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining - an idea inspired by the CPU instruction pipelining - thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments on NVIDIA H100 GPU show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (less than 19.6 percent in throughput) across various LLM sizes, from 13B to 175B.</description><subject>Computer Science - Cryptography and Security</subject><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFjrEOgkAQRK-xMOoHWLk_IIJANLYEYgGJCfZkhQU3OQ9yHCh_LxB7m5kpXiZPiK1jW97Z9-0D6g_31tFzHMt2Xf-0FI8bNxTHyQUibA2gKiCoVckFKcMoIUZd0Ziq6nAcSV2QhJR0zzm18GbzhLShvJNouCeY3iQrKiBUuR4aw7Vai0WJsqXNr1diF4X34LqfbbJG8wv1kE1W2Wzl_ie-iwFCzA</recordid><startdate>20241104</startdate><enddate>20241104</enddate><creator>Tan, Yifan</creator><creator>Tan, Cheng</creator><creator>Mi, Zeyu</creator><creator>Chen, Haibo</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241104</creationdate><title>PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption</title><author>Tan, Yifan ; Tan, Cheng ; Mi, Zeyu ; Chen, Haibo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2411_033573</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Cryptography and Security</topic><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><toplevel>online_resources</toplevel><creatorcontrib>Tan, Yifan</creatorcontrib><creatorcontrib>Tan, Cheng</creatorcontrib><creatorcontrib>Mi, Zeyu</creatorcontrib><creatorcontrib>Chen, Haibo</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tan, Yifan</au><au>Tan, Cheng</au><au>Mi, Zeyu</au><au>Chen, Haibo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption</atitle><date>2024-11-04</date><risdate>2024</risdate><abstract>Confidential computing on GPUs, like NVIDIA H100, mitigates the security risks of outsourced Large Language Models (LLMs) by implementing strong isolation and data encryption. Nonetheless, this encryption incurs a significant performance overhead, reaching up to 52.8 percent and 88.2 percent throughput drop when serving OPT-30B and OPT-66B, respectively. To address this challenge, we introduce PipeLLM, a user-transparent runtime system. PipeLLM removes the overhead by overlapping the encryption and GPU computation through pipelining - an idea inspired by the CPU instruction pipelining - thereby effectively concealing the latency increase caused by encryption. The primary technical challenge is that, unlike CPUs, the encryption module lacks prior knowledge of the specific data needing encryption until it is requested by the GPUs. To this end, we propose speculative pipelined encryption to predict the data requiring encryption by analyzing the serving patterns of LLMs. Further, we have developed an efficient, low-cost pipeline relinquishing approach for instances of incorrect predictions. Our experiments on NVIDIA H100 GPU show that compared with vanilla systems without confidential computing (e.g., vLLM, PEFT, and FlexGen), PipeLLM incurs modest overhead (less than 19.6 percent in throughput) across various LLM sizes, from 13B to 175B.</abstract><doi>10.48550/arxiv.2411.03357</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2411.03357
ispartof
issn
language eng
recordid cdi_arxiv_primary_2411_03357
source arXiv.org
subjects Computer Science - Cryptography and Security
Computer Science - Distributed, Parallel, and Cluster Computing
title PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-20T05%3A13%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PipeLLM:%20Fast%20and%20Confidential%20Large%20Language%20Model%20Services%20with%20Speculative%20Pipelined%20Encryption&rft.au=Tan,%20Yifan&rft.date=2024-11-04&rft_id=info:doi/10.48550/arxiv.2411.03357&rft_dat=%3Carxiv_GOX%3E2411_03357%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true