PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference

Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimize...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ETRI journal 2024, Vol.46 (5), p.817-828
Hauptverfasser: Hyeji Kim, Yeongmin Lee, Chun-Gi Lyuh
Format: Artikel
Sprache:kor
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 828
container_issue 5
container_start_page 817
container_title ETRI journal
container_volume 46
creator Hyeji Kim
Yeongmin Lee
Chun-Gi Lyuh
description Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimized for general matrix-matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix-vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 × 8 processor, thus resulting in a 7.5 × increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 × is achieved in single-batch inferences.
format Article
fullrecord <record><control><sourceid>kisti</sourceid><recordid>TN_cdi_kisti_ndsl_JAKO202472758306009</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>JAKO202472758306009</sourcerecordid><originalsourceid>FETCH-kisti_ndsl_JAKO2024727583060093</originalsourceid><addsrcrecordid>eNqNi8sKwjAUBYMoWNR_yMZlIN7YtLoT8YEiulC3JdZUL6apJFHEr7egH-DqwJyZBokAhGCJANkk0QAgZnIoRZv0vMcTFxKSNI3jiOjdnC1mm-OYHgIafKuAlaWlemGJb7QXqlx-xaDz8HCaoqWF8qH-g8MXe9a4crR8mIB3g_k3Lmq02O0Z1Hqhnba57pJWoYzXvd92SH8-20-X7IY-YGbP3mSryXoLHIYJJHEquOR8JP71PtWLRrc</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference</title><source>DOAJ Directory of Open Access Journals</source><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><source>Wiley Free Content</source><creator>Hyeji Kim ; Yeongmin Lee ; Chun-Gi Lyuh</creator><creatorcontrib>Hyeji Kim ; Yeongmin Lee ; Chun-Gi Lyuh</creatorcontrib><description>Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimized for general matrix-matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix-vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 × 8 processor, thus resulting in a 7.5 × increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 × is achieved in single-batch inferences.</description><identifier>ISSN: 1225-6463</identifier><identifier>EISSN: 2233-7326</identifier><language>kor</language><ispartof>ETRI journal, 2024, Vol.46 (5), p.817-828</ispartof><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,314,780,784,885,4024</link.rule.ids></links><search><creatorcontrib>Hyeji Kim</creatorcontrib><creatorcontrib>Yeongmin Lee</creatorcontrib><creatorcontrib>Chun-Gi Lyuh</creatorcontrib><title>PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference</title><title>ETRI journal</title><addtitle>ETRI journal</addtitle><description>Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimized for general matrix-matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix-vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 × 8 processor, thus resulting in a 7.5 × increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 × is achieved in single-batch inferences.</description><issn>1225-6463</issn><issn>2233-7326</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>JDI</sourceid><recordid>eNqNi8sKwjAUBYMoWNR_yMZlIN7YtLoT8YEiulC3JdZUL6apJFHEr7egH-DqwJyZBokAhGCJANkk0QAgZnIoRZv0vMcTFxKSNI3jiOjdnC1mm-OYHgIafKuAlaWlemGJb7QXqlx-xaDz8HCaoqWF8qH-g8MXe9a4crR8mIB3g_k3Lmq02O0Z1Hqhnba57pJWoYzXvd92SH8-20-X7IY-YGbP3mSryXoLHIYJJHEquOR8JP71PtWLRrc</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Hyeji Kim</creator><creator>Yeongmin Lee</creator><creator>Chun-Gi Lyuh</creator><scope>JDI</scope></search><sort><creationdate>2024</creationdate><title>PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference</title><author>Hyeji Kim ; Yeongmin Lee ; Chun-Gi Lyuh</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-kisti_ndsl_JAKO2024727583060093</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>kor</language><creationdate>2024</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Hyeji Kim</creatorcontrib><creatorcontrib>Yeongmin Lee</creatorcontrib><creatorcontrib>Chun-Gi Lyuh</creatorcontrib><collection>KoreaScience</collection><jtitle>ETRI journal</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hyeji Kim</au><au>Yeongmin Lee</au><au>Chun-Gi Lyuh</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference</atitle><jtitle>ETRI journal</jtitle><addtitle>ETRI journal</addtitle><date>2024</date><risdate>2024</risdate><volume>46</volume><issue>5</issue><spage>817</spage><epage>828</epage><pages>817-828</pages><issn>1225-6463</issn><eissn>2233-7326</eissn><abstract>Owing to the widespread advancement of transformer-based artificial neural networks, artificial intelligence (AI) processors are now required to perform matrix-vector multiplication in addition to the conventional matrix-matrix multiplication. However, current AI processor architectures are optimized for general matrix-matrix multiplications (GEMMs), which causes significant throughput degradation when processing general matrix-vector multiplications (GEMVs). In this study, we proposed a port-folding GEMV (PF-GEMV) scheme employing multiformat and low-precision techniques while reusing an outer product-based processor optimized for conventional GEMM operations. This approach achieves 93.7% utilization in GEMV operations with an 8-bit format on an 8 × 8 processor, thus resulting in a 7.5 × increase in throughput compared with that of the original scheme. Furthermore, when applied to the matrix operation of the GPT-2 large model, an increase in speed by 7 × is achieved in single-batch inferences.</abstract><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1225-6463
ispartof ETRI journal, 2024, Vol.46 (5), p.817-828
issn 1225-6463
2233-7326
language kor
recordid cdi_kisti_ndsl_JAKO202472758306009
source DOAJ Directory of Open Access Journals; Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals; Wiley Free Content
title PF-GEMV: Utilization maximizing architecture in fast matrix-vector multiplication for GPT-2 inference
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T00%3A36%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-kisti&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=PF-GEMV:%20Utilization%20maximizing%20architecture%20in%20fast%20matrix-vector%20multiplication%20for%20GPT-2%20inference&rft.jtitle=ETRI%20journal&rft.au=Hyeji%20Kim&rft.date=2024&rft.volume=46&rft.issue=5&rft.spage=817&rft.epage=828&rft.pages=817-828&rft.issn=1225-6463&rft.eissn=2233-7326&rft_id=info:doi/&rft_dat=%3Ckisti%3EJAKO202472758306009%3C/kisti%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true