An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structu...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shen, Haihao, Meng, Hengyu, Dong, Bo, Wang, Zhe, Zafrir, Ofir, Ding, Yi, Luo, Yu, Chang, Hanwen, Gao, Qun, Wang, Ziheng, Boudoukh, Guy, Wasserblat, Moshe
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Shen, Haihao Meng, Hengyu Dong, Bo Wang, Zhe Zafrir, Ofir Ding, Yi Luo, Yu Chang, Hanwen Gao, Qun Wang, Ziheng Boudoukh, Guy Wasserblat, Moshe
description	In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.
doi_str_mv	10.48550/arxiv.2306.16601
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2306_16601</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2306_16601</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-d4979d8c2dff2975c463c27e0c5b919f127362e0c574a9ed792f45a3b00b004a3</originalsourceid><addsrcrecordid>eNotj8tOwzAURLNhgQofwAr_QIJfsetlFBWoFFSkhnW4sa-rSKlT2eH196QFaUYzsxnpZNkdo4VclyV9gPg9fBZcUFUwpSi7zt6rQDbeD3bAMJP9CWJCsg0eIwaLZD_5-QsikspaHDHCPEXiF7cRQlrKEWPeQ0JHGgiHDzggeZkcjolMgdSvb-kmu_IwJrz9z1XWPm7a-jlvdk_bumpyUJrlThpt3Npy5z03urRSCcs1Ulv2hhnPuBaKn6eWYNBpw70sQfSULpIgVtn93-0FsTvF4QjxpzujdhdU8Qt5uk6O</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs</title><source>arXiv.org</source><creator>Shen, Haihao ; Meng, Hengyu ; Dong, Bo ; Wang, Zhe ; Zafrir, Ofir ; Ding, Yi ; Luo, Yu ; Chang, Hanwen ; Gao, Qun ; Wang, Ziheng ; Boudoukh, Guy ; Wasserblat, Moshe</creator><creatorcontrib>Shen, Haihao ; Meng, Hengyu ; Dong, Bo ; Wang, Zhe ; Zafrir, Ofir ; Ding, Yi ; Luo, Yu ; Chang, Hanwen ; Gao, Qun ; Wang, Ziheng ; Boudoukh, Guy ; Wasserblat, Moshe</creatorcontrib><description>In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.</description><identifier>DOI: 10.48550/arxiv.2306.16601</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Learning</subject><creationdate>2023-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2306.16601$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2306.16601$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shen, Haihao</creatorcontrib><creatorcontrib>Meng, Hengyu</creatorcontrib><creatorcontrib>Dong, Bo</creatorcontrib><creatorcontrib>Wang, Zhe</creatorcontrib><creatorcontrib>Zafrir, Ofir</creatorcontrib><creatorcontrib>Ding, Yi</creatorcontrib><creatorcontrib>Luo, Yu</creatorcontrib><creatorcontrib>Chang, Hanwen</creatorcontrib><creatorcontrib>Gao, Qun</creatorcontrib><creatorcontrib>Wang, Ziheng</creatorcontrib><creatorcontrib>Boudoukh, Guy</creatorcontrib><creatorcontrib>Wasserblat, Moshe</creatorcontrib><title>An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs</title><description>In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tOwzAURLNhgQofwAr_QIJfsetlFBWoFFSkhnW4sa-rSKlT2eH196QFaUYzsxnpZNkdo4VclyV9gPg9fBZcUFUwpSi7zt6rQDbeD3bAMJP9CWJCsg0eIwaLZD_5-QsikspaHDHCPEXiF7cRQlrKEWPeQ0JHGgiHDzggeZkcjolMgdSvb-kmu_IwJrz9z1XWPm7a-jlvdk_bumpyUJrlThpt3Npy5z03urRSCcs1Ulv2hhnPuBaKn6eWYNBpw70sQfSULpIgVtn93-0FsTvF4QjxpzujdhdU8Qt5uk6O</recordid><startdate>20230628</startdate><enddate>20230628</enddate><creator>Shen, Haihao</creator><creator>Meng, Hengyu</creator><creator>Dong, Bo</creator><creator>Wang, Zhe</creator><creator>Zafrir, Ofir</creator><creator>Ding, Yi</creator><creator>Luo, Yu</creator><creator>Chang, Hanwen</creator><creator>Gao, Qun</creator><creator>Wang, Ziheng</creator><creator>Boudoukh, Guy</creator><creator>Wasserblat, Moshe</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230628</creationdate><title>An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs</title><author>Shen, Haihao ; Meng, Hengyu ; Dong, Bo ; Wang, Zhe ; Zafrir, Ofir ; Ding, Yi ; Luo, Yu ; Chang, Hanwen ; Gao, Qun ; Wang, Ziheng ; Boudoukh, Guy ; Wasserblat, Moshe</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-d4979d8c2dff2975c463c27e0c5b919f127362e0c574a9ed792f45a3b00b004a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Shen, Haihao</creatorcontrib><creatorcontrib>Meng, Hengyu</creatorcontrib><creatorcontrib>Dong, Bo</creatorcontrib><creatorcontrib>Wang, Zhe</creatorcontrib><creatorcontrib>Zafrir, Ofir</creatorcontrib><creatorcontrib>Ding, Yi</creatorcontrib><creatorcontrib>Luo, Yu</creatorcontrib><creatorcontrib>Chang, Hanwen</creatorcontrib><creatorcontrib>Gao, Qun</creatorcontrib><creatorcontrib>Wang, Ziheng</creatorcontrib><creatorcontrib>Boudoukh, Guy</creatorcontrib><creatorcontrib>Wasserblat, Moshe</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shen, Haihao</au><au>Meng, Hengyu</au><au>Dong, Bo</au><au>Wang, Zhe</au><au>Zafrir, Ofir</au><au>Ding, Yi</au><au>Luo, Yu</au><au>Chang, Hanwen</au><au>Gao, Qun</au><au>Wang, Ziheng</au><au>Boudoukh, Guy</au><au>Wasserblat, Moshe</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs</atitle><date>2023-06-28</date><risdate>2023</risdate><abstract>In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix - dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic's Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.</abstract><doi>10.48550/arxiv.2306.16601</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2306.16601
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2306_16601
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Learning
title	An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-20T08%3A00%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Efficient%20Sparse%20Inference%20Software%20Accelerator%20for%20Transformer-based%20Language%20Models%20on%20CPUs&rft.au=Shen,%20Haihao&rft.date=2023-06-28&rft_id=info:doi/10.48550/arxiv.2306.16601&rft_dat=%3Carxiv_GOX%3E2306_16601%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true