Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization
Pretraining transformers are generally time-consuming. Fully quantized training (FQT) is a promising approach to speed up pretraining. However, most FQT methods adopt a quantize-compute-dequantize procedure, which often leads to suboptimal speedup and significant performance degradation when used in...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Pretraining transformers are generally time-consuming. Fully quantized
training (FQT) is a promising approach to speed up pretraining. However, most
FQT methods adopt a quantize-compute-dequantize procedure, which often leads to
suboptimal speedup and significant performance degradation when used in
transformers due to the high memory access overheads and low-precision
computations. In this work, we propose Jetfire, an efficient and accurate INT8
training method specific to transformers. Our method features an INT8 data flow
to optimize memory access and a per-block quantization method to maintain the
accuracy of pretrained transformers. Extensive experiments demonstrate that our
INT8 FQT method achieves comparable accuracy to the FP16 training baseline and
outperforms the existing INT8 training works for transformers. Moreover, for a
standard transformer block, our method offers an end-to-end training speedup of
1.42x and a 1.49x memory reduction compared to the FP16 baseline. Our code is
open sourced at https://github.com/thu-ml/Jetfire-INT8Training. |
---|---|
DOI: | 10.48550/arxiv.2403.12422 |