A Practical Mixed Precision Algorithm for Post-Training Quantization

Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks s...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2023-02
Hauptverfasser: Nilesh Prasad Pandey, Nagel, Markus, Mart van Baalen, Huang, Yin, Patel, Chirag, Blankevoort, Tijmen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Nilesh Prasad Pandey
Nagel, Markus
Mart van Baalen
Huang, Yin
Patel, Chirag
Blankevoort, Tijmen
description Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axis of improvement unused. As many hardware solutions provide multiple different bit-width settings, mixed-precision quantization has emerged as a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. However, most existing mixed precision algorithms are rather difficult to use for practitioners as they require access to the training data, have many hyper-parameters to tune or even depend on end-to-end retraining of the entire model. In this work, we present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset to automatically select suitable bit-widths for each layer for desirable on-device performance. Our algorithm requires no hyper-parameter tuning, is robust to data variation and takes into account practical hardware deployment constraints making it a great candidate for practical use. We experimentally validate our proposed method on several computer vision tasks, natural language processing tasks and many different networks, and show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2775848215</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2775848215</sourcerecordid><originalsourceid>FETCH-proquest_journals_27758482153</originalsourceid><addsrcrecordid>eNqNiksKwjAUAIMgWLR3CLgutC-Nzbb4wY1QofsSYlpfiYkmKYintwsP4GoYZhYkAcaKTJQAK5KGMOZ5DrsKOGcJOdS08VJFVNLQC771bXatMKCztDaD8xjvD9o7TxsXYtZ6iRbtQK-TtBE_Ms7jhix7aYJOf1yT7enY7s_Z07vXpEPsRjd5O6cOqoqLUkDB2X_XFxvSOoY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2775848215</pqid></control><display><type>article</type><title>A Practical Mixed Precision Algorithm for Post-Training Quantization</title><source>Free E- Journals</source><creator>Nilesh Prasad Pandey ; Nagel, Markus ; Mart van Baalen ; Huang, Yin ; Patel, Chirag ; Blankevoort, Tijmen</creator><creatorcontrib>Nilesh Prasad Pandey ; Nagel, Markus ; Mart van Baalen ; Huang, Yin ; Patel, Chirag ; Blankevoort, Tijmen</creatorcontrib><description>Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axis of improvement unused. As many hardware solutions provide multiple different bit-width settings, mixed-precision quantization has emerged as a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. However, most existing mixed precision algorithms are rather difficult to use for practitioners as they require access to the training data, have many hyper-parameters to tune or even depend on end-to-end retraining of the entire model. In this work, we present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset to automatically select suitable bit-widths for each layer for desirable on-device performance. Our algorithm requires no hyper-parameter tuning, is robust to data variation and takes into account practical hardware deployment constraints making it a great candidate for practical use. We experimentally validate our proposed method on several computer vision tasks, natural language processing tasks and many different networks, and show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Computer vision ; Hardware ; Mathematical models ; Measurement ; Natural language processing ; Network latency ; Neural networks ; Parameter robustness ; Power consumption ; Robustness ; Tradeoffs ; Training</subject><ispartof>arXiv.org, 2023-02</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Nilesh Prasad Pandey</creatorcontrib><creatorcontrib>Nagel, Markus</creatorcontrib><creatorcontrib>Mart van Baalen</creatorcontrib><creatorcontrib>Huang, Yin</creatorcontrib><creatorcontrib>Patel, Chirag</creatorcontrib><creatorcontrib>Blankevoort, Tijmen</creatorcontrib><title>A Practical Mixed Precision Algorithm for Post-Training Quantization</title><title>arXiv.org</title><description>Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axis of improvement unused. As many hardware solutions provide multiple different bit-width settings, mixed-precision quantization has emerged as a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. However, most existing mixed precision algorithms are rather difficult to use for practitioners as they require access to the training data, have many hyper-parameters to tune or even depend on end-to-end retraining of the entire model. In this work, we present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset to automatically select suitable bit-widths for each layer for desirable on-device performance. Our algorithm requires no hyper-parameter tuning, is robust to data variation and takes into account practical hardware deployment constraints making it a great candidate for practical use. We experimentally validate our proposed method on several computer vision tasks, natural language processing tasks and many different networks, and show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.</description><subject>Algorithms</subject><subject>Computer vision</subject><subject>Hardware</subject><subject>Mathematical models</subject><subject>Measurement</subject><subject>Natural language processing</subject><subject>Network latency</subject><subject>Neural networks</subject><subject>Parameter robustness</subject><subject>Power consumption</subject><subject>Robustness</subject><subject>Tradeoffs</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNiksKwjAUAIMgWLR3CLgutC-Nzbb4wY1QofsSYlpfiYkmKYintwsP4GoYZhYkAcaKTJQAK5KGMOZ5DrsKOGcJOdS08VJFVNLQC771bXatMKCztDaD8xjvD9o7TxsXYtZ6iRbtQK-TtBE_Ms7jhix7aYJOf1yT7enY7s_Z07vXpEPsRjd5O6cOqoqLUkDB2X_XFxvSOoY</recordid><startdate>20230210</startdate><enddate>20230210</enddate><creator>Nilesh Prasad Pandey</creator><creator>Nagel, Markus</creator><creator>Mart van Baalen</creator><creator>Huang, Yin</creator><creator>Patel, Chirag</creator><creator>Blankevoort, Tijmen</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230210</creationdate><title>A Practical Mixed Precision Algorithm for Post-Training Quantization</title><author>Nilesh Prasad Pandey ; Nagel, Markus ; Mart van Baalen ; Huang, Yin ; Patel, Chirag ; Blankevoort, Tijmen</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27758482153</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Computer vision</topic><topic>Hardware</topic><topic>Mathematical models</topic><topic>Measurement</topic><topic>Natural language processing</topic><topic>Network latency</topic><topic>Neural networks</topic><topic>Parameter robustness</topic><topic>Power consumption</topic><topic>Robustness</topic><topic>Tradeoffs</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Nilesh Prasad Pandey</creatorcontrib><creatorcontrib>Nagel, Markus</creatorcontrib><creatorcontrib>Mart van Baalen</creatorcontrib><creatorcontrib>Huang, Yin</creatorcontrib><creatorcontrib>Patel, Chirag</creatorcontrib><creatorcontrib>Blankevoort, Tijmen</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Nilesh Prasad Pandey</au><au>Nagel, Markus</au><au>Mart van Baalen</au><au>Huang, Yin</au><au>Patel, Chirag</au><au>Blankevoort, Tijmen</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>A Practical Mixed Precision Algorithm for Post-Training Quantization</atitle><jtitle>arXiv.org</jtitle><date>2023-02-10</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axis of improvement unused. As many hardware solutions provide multiple different bit-width settings, mixed-precision quantization has emerged as a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. However, most existing mixed precision algorithms are rather difficult to use for practitioners as they require access to the training data, have many hyper-parameters to tune or even depend on end-to-end retraining of the entire model. In this work, we present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset to automatically select suitable bit-widths for each layer for desirable on-device performance. Our algorithm requires no hyper-parameter tuning, is robust to data variation and takes into account practical hardware deployment constraints making it a great candidate for practical use. We experimentally validate our proposed method on several computer vision tasks, natural language processing tasks and many different networks, and show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-02
issn 2331-8422
language eng
recordid cdi_proquest_journals_2775848215
source Free E- Journals
subjects Algorithms
Computer vision
Hardware
Mathematical models
Measurement
Natural language processing
Network latency
Neural networks
Parameter robustness
Power consumption
Robustness
Tradeoffs
Training
title A Practical Mixed Precision Algorithm for Post-Training Quantization
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T20%3A22%3A46IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=A%20Practical%20Mixed%20Precision%20Algorithm%20for%20Post-Training%20Quantization&rft.jtitle=arXiv.org&rft.au=Nilesh%20Prasad%20Pandey&rft.date=2023-02-10&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2775848215%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2775848215&rft_id=info:pmid/&rfr_iscdi=true