Communication Compression for Tensor Parallel LLM Inference

Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks int...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-11
Hauptverfasser: Hansen-Palmus, Jan, Michael Truong Le, Hausdörfer, Oliver, Verma, Alok
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Hansen-Palmus, Jan
Michael Truong Le
Hausdörfer, Oliver
Verma, Alok
description Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3128887123</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3128887123</sourcerecordid><originalsourceid>FETCH-proquest_journals_31288871233</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwds7PzS3Ny0xOLMnMz1MA8gqKUouLQey0_CKFkNS8YiAVkFiUmJOTmqPg4-Or4JmXllqUmpecysPAmpaYU5zKC6W5GZTdXEOcPXQLivILS1OLS-Kz8kuL8oBS8caGRhYWFuaGQJcQpwoABgc3IQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3128887123</pqid></control><display><type>article</type><title>Communication Compression for Tensor Parallel LLM Inference</title><source>Free E- Journals</source><creator>Hansen-Palmus, Jan ; Michael Truong Le ; Hausdörfer, Oliver ; Verma, Alok</creator><creatorcontrib>Hansen-Palmus, Jan ; Michael Truong Le ; Hausdörfer, Oliver ; Verma, Alok</creatorcontrib><description>Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial intelligence ; Compressing ; Inference ; Large language models ; Performance degradation ; Tensors</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Hansen-Palmus, Jan</creatorcontrib><creatorcontrib>Michael Truong Le</creatorcontrib><creatorcontrib>Hausdörfer, Oliver</creatorcontrib><creatorcontrib>Verma, Alok</creatorcontrib><title>Communication Compression for Tensor Parallel LLM Inference</title><title>arXiv.org</title><description>Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.</description><subject>Artificial intelligence</subject><subject>Compressing</subject><subject>Inference</subject><subject>Large language models</subject><subject>Performance degradation</subject><subject>Tensors</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwds7PzS3Ny0xOLMnMz1MA8gqKUouLQey0_CKFkNS8YiAVkFiUmJOTmqPg4-Or4JmXllqUmpecysPAmpaYU5zKC6W5GZTdXEOcPXQLivILS1OLS-Kz8kuL8oBS8caGRhYWFuaGQJcQpwoABgc3IQ</recordid><startdate>20241115</startdate><enddate>20241115</enddate><creator>Hansen-Palmus, Jan</creator><creator>Michael Truong Le</creator><creator>Hausdörfer, Oliver</creator><creator>Verma, Alok</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241115</creationdate><title>Communication Compression for Tensor Parallel LLM Inference</title><author>Hansen-Palmus, Jan ; Michael Truong Le ; Hausdörfer, Oliver ; Verma, Alok</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31288871233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial intelligence</topic><topic>Compressing</topic><topic>Inference</topic><topic>Large language models</topic><topic>Performance degradation</topic><topic>Tensors</topic><toplevel>online_resources</toplevel><creatorcontrib>Hansen-Palmus, Jan</creatorcontrib><creatorcontrib>Michael Truong Le</creatorcontrib><creatorcontrib>Hausdörfer, Oliver</creatorcontrib><creatorcontrib>Verma, Alok</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hansen-Palmus, Jan</au><au>Michael Truong Le</au><au>Hausdörfer, Oliver</au><au>Verma, Alok</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Communication Compression for Tensor Parallel LLM Inference</atitle><jtitle>arXiv.org</jtitle><date>2024-11-15</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-11
issn 2331-8422
language eng
recordid cdi_proquest_journals_3128887123
source Free E- Journals
subjects Artificial intelligence
Compressing
Inference
Large language models
Performance degradation
Tensors
title Communication Compression for Tensor Parallel LLM Inference
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T11%3A35%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Communication%20Compression%20for%20Tensor%20Parallel%20LLM%20Inference&rft.jtitle=arXiv.org&rft.au=Hansen-Palmus,%20Jan&rft.date=2024-11-15&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3128887123%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3128887123&rft_id=info:pmid/&rfr_iscdi=true