FloatX: A C ++ Library for Customized Floating-Point Arithmetic

We present FloatX (Float eXtended), a C ++ framework to investigate the effect of leveraging customized floating-point formats in numerical applications. FloatX formats are based on binary IEEE 754 with smaller significand and exponent bit counts specified by the user. Among other properties, FloatX...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on mathematical software 2019-12, Vol.45 (4), p.1-23
Hauptverfasser: Flegar, Goran, Scheidegger, Florian, Novaković, Vedran, Mariani, Giovani, Tomás, Andrés E., Malossi, A. Cristiano I., Quintana-Ortí, Enrique S.
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We present FloatX (Float eXtended), a C ++ framework to investigate the effect of leveraging customized floating-point formats in numerical applications. FloatX formats are based on binary IEEE 754 with smaller significand and exponent bit counts specified by the user. Among other properties, FloatX facilitates an incremental transformation of the code, relies on hardware-supported floating-point types as back-end to preserve efficiency, and incurs no storage overhead. The article discusses in detail the design principles, programming interface, and datatype casting rules behind FloatX. Furthermore, it demonstrates FloatX’s usage and benefits via several case studies from well-known numerical dense linear algebra libraries, such as BLAS and LAPACK; the Ginkgo library for sparse linear systems; and two neural network applications related with image processing and text recognition.
ISSN:0098-3500
1557-7295
DOI:10.1145/3368086