Efficient Replication for Fast and Predictable Performance in Distributed Computing

Master-worker distributed computing systems use task replication to mitigate the effect of slow workers on job compute time. The master node groups tasks into batches and assigns each batch to one or more workers. We first assume that the batches do not overlap. Using majorization theory, we show th...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on networking 2021-08, Vol.29 (4), p.1467-1476
Hauptverfasser:	Behrouzi-Far, Amir, Soljanin, Emina
Format:	Artikel
Sprache:	eng
Schlagworte:	coefficient of variations Computational modeling Computer architecture Computer networks distributed computing Distributed processing distributed systems Internet latency Machine learning Optimization Redundancy Replication Task analysis Training
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Master-worker distributed computing systems use task replication to mitigate the effect of slow workers on job compute time. The master node groups tasks into batches and assigns each batch to one or more workers. We first assume that the batches do not overlap. Using majorization theory, we show that a balanced replication of batches minimizes the average job compute time for a general class of service time distributions. We then show that the balanced assignment of non-overlapping batches achieves a lower average job compute time than the overlapping schemes proposed in the literature. Next, we derive the optimum redundancy level as a function of the task service time distribution. We show that the redundancy level that minimizes the average job compute time may not coincide with the redundancy level that maximizes job compute time predictability. Therefore, there is a trade-off in optimizing the two metrics. By running experiments on Google cluster traces, we observe that redundancy can reduce the job compute time by one order of magnitude. The optimum level of redundancy depends on the distribution of task service time.
ISSN:	1063-6692 1558-2566
DOI:	10.1109/TNET.2021.3062215