Online Failure Prediction for Complex Systems: Methodology and Case Studies

Online Failure Prediction (OFP) allows proactively taking countermeasures before a failure occurs, such as saving data or restarting a system. However, despite its potential contribution to improving dependability, OFP still presents key limitations. Besides the problem of choosing the optimal set o...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on dependable and secure computing 2023-07, Vol.20 (4), p.3520-3534
Hauptverfasser:	Campos, Joao R., Costa, Ernesto, Vieira, Marco
Format:	Artikel
Sprache:	eng
Schlagworte:	availability Benchmark testing Case studies Complex systems Data models Failure machine learning Online failure prediction Perturbation Perturbation methods Prediction algorithms Prediction models Predictive models reliability Robustness Task analysis
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Online Failure Prediction (OFP) allows proactively taking countermeasures before a failure occurs, such as saving data or restarting a system. However, despite its potential contribution to improving dependability, OFP still presents key limitations. Besides the problem of choosing the optimal set of features, assessing predictive models is complex and common procedures for supporting comparison are not available. There is, in fact, little work on developing and assessing failure predictors for complex systems. In this aricle, we present two extensive case studies on distinct Operating Systems (OSs), Linux and Windows, showing that it is possible to create models that can predict different types of incoming failures, highlighting various important considerations such as the operational requirements of the target system. To drive the case studies, we define a well-structured framework for a fair and sound assessment and comparison of alternative predictive solutions. It includes scenarios for choosing the most adequate metrics for the assessment, comparing alternative models, and selecting the best predictor, while considering the need to tolerate perturbations in the data. In practice, we show that, by following a well-defined process, it is possible to develop accurate failure predictors and establish a ranking of the models under evaluation in different scenarios and OSs.
ISSN:	1545-5971 1941-0018
DOI:	10.1109/TDSC.2022.3192671