Rethinking Software Fault Tolerance
Traditional software fault tolerance makes use of design-diversity-based redundancy. While proven to be effective, the independent development of multiple versions of a program or component is connected with high costs. This article shows that failures caused by so-called Mandelbugs (i.e., software...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on reliability 2024-03, Vol.73 (1), p.67-72 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Traditional software fault tolerance makes use of design-diversity-based redundancy. While proven to be effective, the independent development of multiple versions of a program or component is connected with high costs. This article shows that failures caused by so-called Mandelbugs (i.e., software faults whose activation and/or error propagation depends on the system environment) can often be treated by generating or forcing a new or modified execution environment. In the case of aging-related bugs, a subtype of Mandelbugs, failures can be postponed/prevented via a proactive technique known as software rejuvenation. Indeed, techniques based on environmental diversity, such as retry, reboot, or failover to an identical replica, are successfully used in practice. We discuss two such real-case examples, the IBM Session Initiation Protocol (SIP) Application Server cluster and Avaya gateway servers. |
---|---|
ISSN: | 0018-9529 1558-1721 |
DOI: | 10.1109/TR.2023.3330787 |