Straggler Mitigation With Tiered Gradient Codes

Coding theoretic techniques have been proposed for synchronous Gradient Descent (GD) on multiple servers to mitigate stragglers. These techniques provide the flexibility that the job is complete when any k out of n servers finish their assigned tasks. The task size on each server is found based...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on communications 2020-08, Vol.68 (8), p.4632-4647
Hauptverfasser: Sasi, Shanuja, Lalitha, V., Aggarwal, Vaneet, Rajan, B. Sundar
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Coding theoretic techniques have been proposed for synchronous Gradient Descent (GD) on multiple servers to mitigate stragglers. These techniques provide the flexibility that the job is complete when any k out of n servers finish their assigned tasks. The task size on each server is found based on the values of k and n . However, it is assumed that all the n jobs are started when the job is requested. In contrast, we assume a tiered system, where we start with n_{1}\ge k tasks, and on completion of c tasks, we start n_{2}-n_{1} more tasks. The aim is that as long as k servers can execute their tasks, the job gets completed. This paper exploits the flexibility that not all servers are started at the request time to obtain the achievable task sizes on each server. The task sizes are in general lower than starting all n_{2} tasks at the request times thus helping achieve lower task sizes which helps to reduce both the job completion time and the total server utilization.
ISSN:0090-6778
1558-0857
DOI:10.1109/TCOMM.2020.2992721