Mixed Precision Post Training Quantization of Neural Networks with Sensitivity Guided Search
Serving large-scale machine learning (ML) models efficiently and with low latency has become challenging owing to increasing model size and complexity. Quantizing models can simultaneously reduce memory and compute requirements, facilitating their widespread access. However, for large models not all...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Serving large-scale machine learning (ML) models efficiently and with low
latency has become challenging owing to increasing model size and complexity.
Quantizing models can simultaneously reduce memory and compute requirements,
facilitating their widespread access. However, for large models not all layers
are equally amenable to the same numerical precision and aggressive
quantization can lead to unacceptable loss in model accuracy. One approach to
prevent this accuracy degradation is mixed-precision quantization, which allows
different tensors to be quantized to varying levels of numerical precision,
leveraging the capabilities of modern hardware. Such mixed-precision
quantiztaion can more effectively allocate numerical precision to different
tensors `as needed' to preserve model accuracy while reducing footprint and
compute latency. In this paper, we propose a method to efficiently determine
quantization configurations of different tensors in ML models using
post-training mixed precision quantization. We analyze three sensitivity
metrics and evaluate them for guiding configuration search of two algorithms.
We evaluate our method for computer vision and natural language processing and
demonstrate latency reductions of up to 27.59% and 34.31% compared to the
baseline 16-bit floating point model while guaranteeing no more than 1%
accuracy degradation. |
---|---|
DOI: | 10.48550/arxiv.2302.01382 |