Communication Compression for Tensor Parallel LLM Inference
Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks int...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-11 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Hansen-Palmus, Jan Michael Truong Le Hausdörfer, Oliver Verma, Alok |
description | Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3128887123</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3128887123</sourcerecordid><originalsourceid>FETCH-proquest_journals_31288871233</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwds7PzS3Ny0xOLMnMz1MA8gqKUouLQey0_CKFkNS8YiAVkFiUmJOTmqPg4-Or4JmXllqUmpecysPAmpaYU5zKC6W5GZTdXEOcPXQLivILS1OLS-Kz8kuL8oBS8caGRhYWFuaGQJcQpwoABgc3IQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3128887123</pqid></control><display><type>article</type><title>Communication Compression for Tensor Parallel LLM Inference</title><source>Free E- Journals</source><creator>Hansen-Palmus, Jan ; Michael Truong Le ; Hausdörfer, Oliver ; Verma, Alok</creator><creatorcontrib>Hansen-Palmus, Jan ; Michael Truong Le ; Hausdörfer, Oliver ; Verma, Alok</creatorcontrib><description>Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial intelligence ; Compressing ; Inference ; Large language models ; Performance degradation ; Tensors</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Hansen-Palmus, Jan</creatorcontrib><creatorcontrib>Michael Truong Le</creatorcontrib><creatorcontrib>Hausdörfer, Oliver</creatorcontrib><creatorcontrib>Verma, Alok</creatorcontrib><title>Communication Compression for Tensor Parallel LLM Inference</title><title>arXiv.org</title><description>Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.</description><subject>Artificial intelligence</subject><subject>Compressing</subject><subject>Inference</subject><subject>Large language models</subject><subject>Performance degradation</subject><subject>Tensors</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwds7PzS3Ny0xOLMnMz1MA8gqKUouLQey0_CKFkNS8YiAVkFiUmJOTmqPg4-Or4JmXllqUmpecysPAmpaYU5zKC6W5GZTdXEOcPXQLivILS1OLS-Kz8kuL8oBS8caGRhYWFuaGQJcQpwoABgc3IQ</recordid><startdate>20241115</startdate><enddate>20241115</enddate><creator>Hansen-Palmus, Jan</creator><creator>Michael Truong Le</creator><creator>Hausdörfer, Oliver</creator><creator>Verma, Alok</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241115</creationdate><title>Communication Compression for Tensor Parallel LLM Inference</title><author>Hansen-Palmus, Jan ; Michael Truong Le ; Hausdörfer, Oliver ; Verma, Alok</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31288871233</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Artificial intelligence</topic><topic>Compressing</topic><topic>Inference</topic><topic>Large language models</topic><topic>Performance degradation</topic><topic>Tensors</topic><toplevel>online_resources</toplevel><creatorcontrib>Hansen-Palmus, Jan</creatorcontrib><creatorcontrib>Michael Truong Le</creatorcontrib><creatorcontrib>Hausdörfer, Oliver</creatorcontrib><creatorcontrib>Verma, Alok</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hansen-Palmus, Jan</au><au>Michael Truong Le</au><au>Hausdörfer, Oliver</au><au>Verma, Alok</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Communication Compression for Tensor Parallel LLM Inference</atitle><jtitle>arXiv.org</jtitle><date>2024-11-15</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators through various Model Parallelism strategies. Our paper looks into the details on one such strategy - Tensor Parallel - and proposes to reduce latency by compressing inter-accelerator communication. We leverage fine grained quantization techniques to compress selected activations by 3.5 - 4.5x. Our proposed method leads up to 2x reduction of time-to-first-token (TTFT) with negligible model performance degradation.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-11 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_3128887123 |
source | Free E- Journals |
subjects | Artificial intelligence Compressing Inference Large language models Performance degradation Tensors |
title | Communication Compression for Tensor Parallel LLM Inference |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T11%3A35%3A54IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Communication%20Compression%20for%20Tensor%20Parallel%20LLM%20Inference&rft.jtitle=arXiv.org&rft.au=Hansen-Palmus,%20Jan&rft.date=2024-11-15&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3128887123%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3128887123&rft_id=info:pmid/&rfr_iscdi=true |