Communication Compression for Tensor Parallel LLM Inference

Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks int...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-11
Hauptverfasser:	Hansen-Palmus, Jan, Michael Truong Le, Hausdörfer, Oliver, Verma, Alok
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Compressing Inference Large language models Performance degradation Tensors
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Hansen-Palmus, Jan Michael Truong Le Hausdörfer, Oliver Verma, Alok
description	Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3128887123</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3128887123</sourcerecordid><originalsourceid>FETCH-proquest_journals_31288871233</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwds7PzS3Ny0xOLMnMz1MA8gqKUouLQey0_CKFkNS8YiAVkFiUmJOTmqPg4-Or4JmXllqUmpecysPAmpaYU5zKC6W5GZTdXEOcPXQLivILS1OLS-Kz8kuL8oBS8caGRhYWFuaGQJcQpwoABgc3IQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3128887123</pqid></control><display><type>article</type><title>Communication Compression for Tensor Parallel LLM Inference</title><source>Free E- Journals</source><creator>Hansen-Palmus, Jan ; Michael Truong Le ; Hausdörfer, Oliver ; Verma, Alok</creator><creatorcontrib>Hansen-Palmus, Jan ; Michael Truong Le ; Hausdörfer, Oliver ; Verma, Alok</creatorcontrib><description>Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial intelligence ; Compressing ; Inference ; Large language models ; Performance degradation ; Tensors</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Hansen-Palmus, Jan</creatorcontrib><creatorcontrib>Michael Truong Le</creatorcontrib><creatorcontrib>Hausdörfer, Oliver</creatorcontrib><creatorcontrib>Verma, Alok</creatorcontrib><title>Communication Compression for Tensor Parallel LLM Inference</title><title>arXiv.org</title><description>Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.</description><subject>Artificial intelligence</subject><subject>Compressing</subject><subject>Inference</subject><subject>Large language models</subject><subject>Performance degradation</subject><subject>Tensors</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwds7PzS3Ny0xOLMnMz1MA8gqKUouLQey0_CKFkNS8YiAVkFiUmJOTmqPg4-Or4JmXllqUmpecysPAmpaYU5zKC6W5GZTdXEOcPXQLivILS1OLS-Kz8kuL8oBS8caGRhYWFuaGQJcQpwoABgc3IQ</recordid><startdate>20241115</startdate><enddate>20241115</enddate><creator>Hansen-Palmus, Jan</creator><creator>Michael Truong Le</creator><creator>Hausdörfer, Oliver</creator><creator>Verma, Alok</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241115</creationdate><title>Communication Compression for Tensor Parallel LLM Inference</title><author>Hansen-Palmus, Jan ; Michael Truong Le ; Hausdörfer, Oliver ; Verma, Alok</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31288871233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial intelligence</topic><topic>Compressing</topic><topic>Inference</topic><topic>Large language models</topic><topic>Performance degradation</topic><topic>Tensors</topic><toplevel>online_resources</toplevel><creatorcontrib>Hansen-Palmus, Jan</creatorcontrib><creatorcontrib>Michael Truong Le</creatorcontrib><creatorcontrib>Hausdörfer, Oliver</creatorcontrib><creatorcontrib>Verma, Alok</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hansen-Palmus, Jan</au><au>Michael Truong Le</au><au>Hausdörfer, Oliver</au><au>Verma, Alok</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Communication Compression for Tensor Parallel LLM Inference</atitle><jtitle>arXiv.org</jtitle><date>2024-11-15</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3128887123
source	Free E- Journals
subjects	Artificial intelligence Compressing Inference Large language models Performance degradation Tensors
title	Communication Compression for Tensor Parallel LLM Inference
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T11%3A35%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Communication%20Compression%20for%20Tensor%20Parallel%20LLM%20Inference&rft.jtitle=arXiv.org&rft.au=Hansen-Palmus,%20Jan&rft.date=2024-11-15&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3128887123%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3128887123&rft_id=info:pmid/&rfr_iscdi=true