Distributed process redundancy

The majority of Internet outages are directly attributable to software upgrade issues and software quality in general. Mitigation of network downtime is a constant battle for service providers. In pursuit of "five 9's availability" or 99.999% network up time, service providers must mi...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Kidder, Joseph D, Langrind, Nicholas A, Sullivan, Jr., Daniel J, Fox, Barbara A, Whitesel, Richard L
Format: Patent
Sprache:eng
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The majority of Internet outages are directly attributable to software upgrade issues and software quality in general. Mitigation of network downtime is a constant battle for service providers. In pursuit of "five 9's availability" or 99.999% network up time, service providers must minimize network outages due to equipment (i.e., hardware) and all too common software failures. Service providers not only incur downtime due to failures, but also incur downtime for upgrades to deploy new or improved software, hardware, software or hardware fixes or patches that are needed to deal with current network problems. A network outage can also occur after an upgrade has been installed if the upgrade itself includes undetected problems (i.e., bugs) or if the upgrade causes other software or hardware to have problems. Data merging, data conversion and untested compatibilities contribute to downtime. Upgrades often result in data loss due to incompatibilities with data file formats. Downtime may occur unexpectedly days after an upgrade due to lurking software or hardware incompatibilities. Often, the upgrade of one process results in the failure of another process. This is often referred to as regression. Sometimes one change can cause several other components to fail; this is often called the "ripple" effect. To avoid compatibility problems, multiple versions (upgraded and not upgraded versions) of the same software are not executed at the same time. A distributed software redundancy design is disclosed to minimize network outages and other problems associated with component/process failures by spreading software backup (in the so-called "hot state") across multiple elements. The distributed redundancy architecture of the present invention also permits the location of the hardware backup element to float, that is, if a primary element fails, the functions can be transferred over to the backup element. When the failed primary element is replaced, the replacement hardware can serve as the hardware backup. If one or more of the primary processes on a particular element experiences a software fault, the processor on the line card may terminate and restart the failing process or processes. Once the process or processes are restarted, a copy of the last known dynamic state (i.e., the backup state) can be retrieved a from corresponding backup processes executing on a second line card and initiate an audit process to synchronize retrieved state with the dynamic state of associ