Hercules: Heterogeneity-Aware Inference Serving for At-Scale Personalized Recommendation
Personalized recommendation is an important class of deep-learning applications that powers a large collection of internet services and consumes a considerable amount of datacenter resources. As the scale of production-grade recommendation systems continues to grow, optimizing their serving performa...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Personalized recommendation is an important class of deep-learning
applications that powers a large collection of internet services and consumes a
considerable amount of datacenter resources. As the scale of production-grade
recommendation systems continues to grow, optimizing their serving performance
and efficiency in a heterogeneous datacenter is important and can translate
into infrastructure capacity saving. In this paper, we propose Hercules, an
optimized framework for personalized recommendation inference serving that
targets diverse industry-representative models and cloud-scale heterogeneous
systems. Hercules performs a two-stage optimization procedure - offline
profiling and online serving. The first stage searches the large under-explored
task scheduling space with a gradient-based search algorithm achieving up to
9.0x latency-bounded throughput improvement on individual servers; it also
identifies the optimal heterogeneous server architecture for each
recommendation workload. The second stage performs heterogeneity-aware cluster
provisioning to optimize resource mapping and allocation in response to
fluctuating diurnal loads. The proposed cluster scheduler in Hercules achieves
47.7% cluster capacity saving and reduces the provisioned power by 23.7% over a
state-of-the-art greedy scheduler. |
---|---|
DOI: | 10.48550/arxiv.2203.07424 |