PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training

With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstra...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Xu, Haoran, Liu, Ziqian, Fu, Rong, Su, Zhongling, Wang, Zerui, Cai, Zheng, Pei, Zhilin, Zhang, Xingcheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Xu, Haoran Liu, Ziqian Fu, Rong Su, Zhongling Wang, Zerui Cai, Zheng Pei, Zhilin Zhang, Xingcheng
description	With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory complexity. Nevertheless, the existing training framework of Mamba presents inefficiency with variable-length sequence inputs. Either single-sequence training results in low GPU utilization, or batched processing of variable-length sequences to a maximum length incurs considerable memory and computational overhead. To address this problem, we analyze the performance of bottleneck operators in Mamba under diverse tensor shapes and proposed PackMamba, a high-throughput Mamba that efficiently handles variable-length sequences. Diving deep into state-space models (SSMs), we modify the parallel operators to avoid passing information between individual sequences while maintaining high performance. Experimental results on an NVIDIA A100 GPU demonstrate throughput exceeding the baseline single-sequence processing scheme: 3.06x speedup on the 1.4B model and 2.62x on the 2.8B model.
doi_str_mv	10.48550/arxiv.2408.03865
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_03865</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_03865</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_038653</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw0DMwtjAz5WTwC0hMzvZNzE1KtFJwTUvLTM5MzStRCCjKT04tLs7MS1fIT1MISyzKTEzKSdX1Sc1LL8lQCE4tLE3NAypQyMxTAOtVKClKzMwDKudhYE1LzClO5YXS3Azybq4hzh66YJvjC4oycxOLKuNBLogHu8CYsAoAOOE7dg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training</title><source>arXiv.org</source><creator>Xu, Haoran ; Liu, Ziqian ; Fu, Rong ; Su, Zhongling ; Wang, Zerui ; Cai, Zheng ; Pei, Zhilin ; Zhang, Xingcheng</creator><creatorcontrib>Xu, Haoran ; Liu, Ziqian ; Fu, Rong ; Su, Zhongling ; Wang, Zerui ; Cai, Zheng ; Pei, Zhilin ; Zhang, Xingcheng</creatorcontrib><description>With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory complexity. Nevertheless, the existing training framework of Mamba presents inefficiency with variable-length sequence inputs. Either single-sequence training results in low GPU utilization, or batched processing of variable-length sequences to a maximum length incurs considerable memory and computational overhead. To address this problem, we analyze the performance of bottleneck operators in Mamba under diverse tensor shapes and proposed PackMamba, a high-throughput Mamba that efficiently handles variable-length sequences. Diving deep into state-space models (SSMs), we modify the parallel operators to avoid passing information between individual sequences while maintaining high performance. Experimental results on an NVIDIA A100 GPU demonstrate throughput exceeding the baseline single-sequence processing scheme: 3.06x speedup on the 1.4B model and 2.62x on the 2.8B model.</description><identifier>DOI: 10.48550/arxiv.2408.03865</identifier><language>eng</language><subject>Computer Science - Learning</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.03865$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.03865$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Xu, Haoran</creatorcontrib><creatorcontrib>Liu, Ziqian</creatorcontrib><creatorcontrib>Fu, Rong</creatorcontrib><creatorcontrib>Su, Zhongling</creatorcontrib><creatorcontrib>Wang, Zerui</creatorcontrib><creatorcontrib>Cai, Zheng</creatorcontrib><creatorcontrib>Pei, Zhilin</creatorcontrib><creatorcontrib>Zhang, Xingcheng</creatorcontrib><title>PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training</title><description>With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory complexity. Nevertheless, the existing training framework of Mamba presents inefficiency with variable-length sequence inputs. Either single-sequence training results in low GPU utilization, or batched processing of variable-length sequences to a maximum length incurs considerable memory and computational overhead. To address this problem, we analyze the performance of bottleneck operators in Mamba under diverse tensor shapes and proposed PackMamba, a high-throughput Mamba that efficiently handles variable-length sequences. Diving deep into state-space models (SSMs), we modify the parallel operators to avoid passing information between individual sequences while maintaining high performance. Experimental results on an NVIDIA A100 GPU demonstrate throughput exceeding the baseline single-sequence processing scheme: 3.06x speedup on the 1.4B model and 2.62x on the 2.8B model.</description><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw0DMwtjAz5WTwC0hMzvZNzE1KtFJwTUvLTM5MzStRCCjKT04tLs7MS1fIT1MISyzKTEzKSdX1Sc1LL8lQCE4tLE3NAypQyMxTAOtVKClKzMwDKudhYE1LzClO5YXS3Azybq4hzh66YJvjC4oycxOLKuNBLogHu8CYsAoAOOE7dg</recordid><startdate>20240807</startdate><enddate>20240807</enddate><creator>Xu, Haoran</creator><creator>Liu, Ziqian</creator><creator>Fu, Rong</creator><creator>Su, Zhongling</creator><creator>Wang, Zerui</creator><creator>Cai, Zheng</creator><creator>Pei, Zhilin</creator><creator>Zhang, Xingcheng</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240807</creationdate><title>PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training</title><author>Xu, Haoran ; Liu, Ziqian ; Fu, Rong ; Su, Zhongling ; Wang, Zerui ; Cai, Zheng ; Pei, Zhilin ; Zhang, Xingcheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_038653</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Xu, Haoran</creatorcontrib><creatorcontrib>Liu, Ziqian</creatorcontrib><creatorcontrib>Fu, Rong</creatorcontrib><creatorcontrib>Su, Zhongling</creatorcontrib><creatorcontrib>Wang, Zerui</creatorcontrib><creatorcontrib>Cai, Zheng</creatorcontrib><creatorcontrib>Pei, Zhilin</creatorcontrib><creatorcontrib>Zhang, Xingcheng</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xu, Haoran</au><au>Liu, Ziqian</au><au>Fu, Rong</au><au>Su, Zhongling</au><au>Wang, Zerui</au><au>Cai, Zheng</au><au>Pei, Zhilin</au><au>Zhang, Xingcheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training</atitle><date>2024-08-07</date><risdate>2024</risdate><abstract>With the evolution of large language models, traditional Transformer models become computationally demanding for lengthy sequences due to the quadratic growth in computation with respect to the sequence length. Mamba, emerging as a groundbreaking architecture in the field of generative AI, demonstrates remarkable proficiency in handling elongated sequences with reduced computational and memory complexity. Nevertheless, the existing training framework of Mamba presents inefficiency with variable-length sequence inputs. Either single-sequence training results in low GPU utilization, or batched processing of variable-length sequences to a maximum length incurs considerable memory and computational overhead. To address this problem, we analyze the performance of bottleneck operators in Mamba under diverse tensor shapes and proposed PackMamba, a high-throughput Mamba that efficiently handles variable-length sequences. Diving deep into state-space models (SSMs), we modify the parallel operators to avoid passing information between individual sequences while maintaining high performance. Experimental results on an NVIDIA A100 GPU demonstrate throughput exceeding the baseline single-sequence processing scheme: 3.06x speedup on the 1.4B model and 2.62x on the 2.8B model.</abstract><doi>10.48550/arxiv.2408.03865</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2408.03865
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2408_03865
source	arXiv.org
subjects	Computer Science - Learning
title	PackMamba: Efficient Processing of Variable-Length Sequences in Mamba training
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T20%3A26%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PackMamba:%20Efficient%20Processing%20of%20Variable-Length%20Sequences%20in%20Mamba%20training&rft.au=Xu,%20Haoran&rft.date=2024-08-07&rft_id=info:doi/10.48550/arxiv.2408.03865&rft_dat=%3Carxiv_GOX%3E2408_03865%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true