MULTS: A multi-cloud fault-tolerant architecture to manage transient servers in cloud computing

•An architecture to provide an efficient way to use transient servers in cloud.•Use of a scenario-optimal checkpoint to execution guarantee and reduce user costs.•Experiments used 21 million price changes collected from Amazon AWS spot instances.•Experiments created a knowledge database with approxi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of systems architecture 2019-12, Vol.101, p.101651, Article 101651
Hauptverfasser:	Araujo Neto, Jose Pergentino, Pianto, Donald M., Ralha, Célia Ghedini
Format:	Artikel
Sprache:	eng
Schlagworte:	Checkpoint Cloud computing Fault tolerance Machine learning Resilient architecture Spot instance Survival analysis
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•An architecture to provide an efficient way to use transient servers in cloud.•Use of a scenario-optimal checkpoint to execution guarantee and reduce user costs.•Experiments used 21 million price changes collected from Amazon AWS spot instances.•Experiments created a knowledge database with approximately 110 million records.•Prediction accuracy reached 92% rate demonstrating the potential of the approach. The large-scale utilization of cloud computing resources has led to the emergence of cloud environment reliability as an important issue. In addition, cloud providers are negotiating unreliable virtual machines as a result of exploring unused resources offering them as transient servers - a lower price virtual machine service with resource revocations without user intervention. To increase the availability of transient servers, we propose a multi-cloud fault-tolerant architecture to provide a resilient environment using a scenario-based optimal checkpoint in a scheme to guarantee running processes with reduced user costs. The architecture combines a heuristic to extract information from a case-based reasoning and a statistical model to predict failure events helping to refine fault tolerance parameters. As a result, a cloud environment with better levels of reliability and reduced execution time is provided. Extensive simulations show high levels of accuracy reaching up to 92% survival prediction success rate and a gain of 74,58% of execution time reduction for long running applications. The results are promising, indicating that the proposed architecture can prevent revocation failures under realistic working conditions.
ISSN:	1383-7621 1873-6165
DOI:	10.1016/j.sysarc.2019.101651