Fault-Tolerant Mechanism of the Distributed Cluster Computers

The distributed system with high performance and stability is commonly adopted in large scale scientific and engineering computing. In this paper, we discuss a fault-tolerant mechanism under Linux circumstance to improve the fault-tolerant ability of the system, namely a scheme and frame to form the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Tsinghua science and technology 2007-07, Vol.12 (S1), p.186-191
Hauptverfasser:	Shang, Yizi, Jin, Yang, Wu, Baosheng
Format:	Artikel
Sprache:	eng
Schlagworte:	active list distributed system file invocation multilevel fault-tolerance
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The distributed system with high performance and stability is commonly adopted in large scale scientific and engineering computing. In this paper, we discuss a fault-tolerant mechanism under Linux circumstance to improve the fault-tolerant ability of the system, namely a scheme and frame to form the stable computing platform. In terms of the structure and function of the distributed system, active list and file invocation strategies are employed in the task management. System multilevel fault-tolerance can be achieved by repeated processes in a single node and task migration on multi-nodes. Manager node agent introduced in this paper administrates the nodes using the list, disposes of the tasks according to the nodes' performance, and hence, to be able to make full use of the cluster resources. An evaluation method is proposed to appraise the performance. The analyzed results show the usefulness of the scheme proposed except for some additional overhead of memory consumption.
ISSN:	1007-0214 1878-7606 1007-0214
DOI:	10.1016/S1007-0214(07)70107-4