FAST EIGHT-BIT FLOATING POINT (FP8) SIMULATION WITH LEARNABLE PARAMETERS

A processor-implemented method for fast floating point simulations with learnable parameters includes receiving a single precision input. An integer quantization process is performed on the input. Each element of the input is scaled based on a scaling parameter to generate an m-bit floating point ou...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	NAGEL, Markus, PETERS, Jorn Wilhelmus Timotheus, BLANKEVOORT, Tijmen Pieter Frederik, VAN BAALEN, Marinus Willem, KUZMIN, Andrey
Format:	Patent
Sprache:	eng ; fre
Schlagworte:	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A processor-implemented method for fast floating point simulations with learnable parameters includes receiving a single precision input. An integer quantization process is performed on the input. Each element of the input is scaled based on a scaling parameter to generate an m-bit floating point output, where m is an integer. Un procédé mis en œuvre par processeur pour des simulations à virgule flottante rapides avec des paramètres pouvant être appris consiste à recevoir une entrée de précision unique. Un processus de quantification de nombres entiers est effectué sur l'entrée. Chaque élément de l'entrée est mis à l'échelle sur la base d'un paramètre de mise à l'échelle pour générer une sortie en virgule flottante de m bits, m étant un nombre entier.