Datacenter-Scale Analysis and Optimization of GPU Machine Learning Workloads

In this article, we present a system to collectively optimize efficiency in a very large scale deployment of GPU servers for machine learning workloads at Facebook. Our system 1) measures and stores system-wide efficiency metrics for every executed workflow; 2) aggregates data from across the execut...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE MICRO 2021-09, Vol.41 (5), p.101-112
Hauptverfasser:	Wesolowski, Lukasz, Acun, Bilge, Andrei, Valentin, Aziz, Adnan, Dankel, Gisle, Gregg, Christopher, Meng, Xiaoqiao, Meurillon, Cyril, Sheahan, Denis, Tian, Lei, Yang, Janet, Yu, Peifeng, Hazelwood, Kim
Format:	Artikel
Sprache:	eng
Schlagworte:	Data centers Efficiency Graphics processing units Machine learning Measurement Optimization Social networking (online) Telemetry Training data Very large scale Workflow Workload Workloads
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this article, we present a system to collectively optimize efficiency in a very large scale deployment of GPU servers for machine learning workloads at Facebook. Our system 1) measures and stores system-wide efficiency metrics for every executed workflow; 2) aggregates data from across the execution stack to identify optimization opportunities that maximize fleet-wide efficiency improvements; 3) provides periodic and on-demand whole-system profiling for workflows; and 4) automatically analyzes traces for common antipatterns. We present each component of the stack and show case studies demonstrating the use of the tools to significantly improve performance. To our knowledge, our system is the most complete and effective solution for identifying and addressing efficiency problems in datacenter-scale GPU deployments.
ISSN:	0272-1732 1937-4143
DOI:	10.1109/MM.2021.3097287