ELASTICALLY MANAGING WORKERS OF MULTI-WORKER WORKLOADS ON ACCELERATOR DEVICES

The disclosure herein describes elastically managing the execution of workers of multi-worker workloads on accelerator devices. A first worker of a workload is executed on an accelerator device during a first time interval. A first context switch point is identified when the first worker is in a fir...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: GULAVANI, Bhargav, SIVATHANU, Muthian, VISWANATHA, Srinidhi, WELANKAR, Kaustubh, SHUKLA, Dharma Kiritkumar, NEHME, Rimma Vladimirovna, RAMJEE, Ramachandran, AGRAWAL, Amey, ANUPINDI, Ravi Shreyas
Format: Patent
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The disclosure herein describes elastically managing the execution of workers of multi-worker workloads on accelerator devices. A first worker of a workload is executed on an accelerator device during a first time interval. A first context switch point is identified when the first worker is in a first worker state. At the identified context switch point, a first memory state of the first worker is stored in a host memory and the accelerator device is configured to a second memory state of the second worker. The second worker is executed during a second time interval and a second context switch point is identified at the end of the second time interval when the second worker is in a state that is equivalent to the first worker state. During the intervals, collective communication operations between the workers are accumulated and, at the second context switch point, the accumulated operations are performed.