LCFI: A Fault Injection Tool for Studying Lossy Compression Error Propagation in HPC Programs
Error-bounded lossy compression is becoming more and more important to today's extreme-scale HPC applications because of the ever-increasing volume of data generated because it has been widely used in in-situ visualization, data stream intensity reduction, storage reduction, I/O performance imp...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Error-bounded lossy compression is becoming more and more important to
today's extreme-scale HPC applications because of the ever-increasing volume of
data generated because it has been widely used in in-situ visualization, data
stream intensity reduction, storage reduction, I/O performance improvement,
checkpoint/restart acceleration, memory footprint reduction, etc. Although many
works have optimized ratio, quality, and performance for different
error-bounded lossy compressors, there is none of the existing works attempting
to systematically understand the impact of lossy compression errors on HPC
application due to error propagation.
In this paper, we propose and develop a lossy compression fault injection
tool, called LCFI. To the best of our knowledge, this is the first fault
injection tool that helps both lossy compressor developers and users to
systematically and comprehensively understand the impact of lossy compression
errors on HPC programs. The contributions of this work are threefold: (1) We
propose an efficient approach to inject lossy compression errors according to a
statistical analysis of compression errors for different state-of-the-art
compressors. (2) We build a fault injector which is highly applicable,
customizable, easy-to-use in generating top-down comprehensive results, and
demonstrate the use of LCFI. (3) We evaluate LCFI on four representative HPC
benchmarks with different abstracted fault models and make several observations
about error propagation and their impacts on program outputs. |
---|---|
DOI: | 10.48550/arxiv.2010.12746 |