NEURAL NETWORK INFERENCE QUANTIZATION

One or more computer processors responsive to neural network run-time, reduce one or more sets of maximum activations along a hidden dimension respectively associated with one or more activation tensors and one or more layers of a neural network. The one or more computer processors compute an interq...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Quinn, Jerome L, Ward, Robert Todd, EL-KURDI, YOUSEF
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:One or more computer processors responsive to neural network run-time, reduce one or more sets of maximum activations along a hidden dimension respectively associated with one or more activation tensors and one or more layers of a neural network. The one or more computer processors compute an interquartile range (IQR) clip threshold for each reduced set for each sequence dimension in the neural network. The one or more computer processors clip one or more activations based on respective computed IQR clip thresholds. The one or more computer processors quantize the clipped activations.