Mao: Machine learning approach for NUMA optimization in Warehouse Scale Computers
Non-Uniform Memory Access (NUMA) architecture imposes numerous performance challenges to today's cloud workloads. Due to the complexity and the massive scale of modern warehouse-scale computers (WSCs), a lot of efforts need to be done to improve the memory access locality on the NUMA architectu...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Non-Uniform Memory Access (NUMA) architecture imposes numerous performance
challenges to today's cloud workloads. Due to the complexity and the massive
scale of modern warehouse-scale computers (WSCs), a lot of efforts need to be
done to improve the memory access locality on the NUMA architecture. In Baidu,
we have found that NUMA optimization has significant performance benefit to the
major workloads like Search and Feed (Baidu's recommendation system). But how
to conduct NUMA optimization within the large scale cluster brings a lot of
subtle complexities and workload-specific scenario optimizations. In this
paper, we will present a production environment deployed solution in Baidu
called MAP (Memory Access Optimizer) that helps improve the memory access
locality for Baidu's various workloads. MAO includes an online module and an
offline module. The online module is responsible for the online monitoring,
dynamic NUMA node binding and runtime optimization. Meanwhile the offline
workload characterization module will proceed with the data analysis and
resource-sensitivity module training. We also proposed a new performance model
called "NUMA Sensitivity model" to address the impact of remote memory access
to workload performance and projection of the potential performance
improvements via NUMA optimization for a specific workload. Based on continuous
data collected from online monitoring, this model is proved to be working
properly in MAO. As of today, we have successfully deployed MAO to more than
one hundred thousand servers. In our Feed product, we have achieved 12.1%
average latency improvements and 9.8% CPU resource saving. |
---|---|
DOI: | 10.48550/arxiv.2411.01460 |