Rethinking Software Fault Tolerance

Traditional software fault tolerance makes use of design-diversity-based redundancy. While proven to be effective, the independent development of multiple versions of a program or component is connected with high costs. This article shows that failures caused by so-called Mandelbugs (i.e., software...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on reliability 2024-03, Vol.73 (1), p.67-72
Hauptverfasser: Trivedi, Kishor S., Grottke, Michael, Lopez, Javier Alonso
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Traditional software fault tolerance makes use of design-diversity-based redundancy. While proven to be effective, the independent development of multiple versions of a program or component is connected with high costs. This article shows that failures caused by so-called Mandelbugs (i.e., software faults whose activation and/or error propagation depends on the system environment) can often be treated by generating or forcing a new or modified execution environment. In the case of aging-related bugs, a subtype of Mandelbugs, failures can be postponed/prevented via a proactive technique known as software rejuvenation. Indeed, techniques based on environmental diversity, such as retry, reboot, or failover to an identical replica, are successfully used in practice. We discuss two such real-case examples, the IBM Session Initiation Protocol (SIP) Application Server cluster and Avaya gateway servers.
ISSN:0018-9529
1558-1721
DOI:10.1109/TR.2023.3330787