Optimum: Runtime optimization for multiple mixed model deployment deep learning inference

GPUs used in data centers to perform deep learning inference tasks are underutilized. Previous systems tended to deploy a single model on a GPU to ensure that inference tasks met throughput and latency requirements. The rapid increase in one GPU’s resources, as well as the emergence of scenarios suc...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of systems architecture 2023-08, Vol.141, p.102901, Article 102901
Hauptverfasser:	Guo, Kaicheng, Xu, Yixiao, Qi, Zhengwei, Guan, Haibing
Format:	Artikel
Sprache:	eng
Schlagworte:	Co-location GPU inference serving Inference system Mixed deployment Performance prediction
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	GPUs used in data centers to perform deep learning inference tasks are underutilized. Previous systems tended to deploy a single model on a GPU to ensure that inference tasks met throughput and latency requirements. The rapid increase in one GPU’s resources, as well as the emergence of scenarios such as small models and small batches, have exacerbated the issue of low GPU utilization. In this case, a mixed model deployment-based solution can significantly improve GPU utilization while also providing greater flexibility to the inference system’s upper layer. The selection of model combinations and optimization strategies in mixed model deployment, however, remain unresolved issues. This paper proposes Optimum, the first model-combination planning and runtime optimization framework for mixed model deployment. Facing enormous search spaces, Optimum uses performance prediction for model combination selection with low search overhead. The predictor is based on a multilayer perceptron. Its input features are the profiling results of the model engine, and the output is the performance degradation. The runtime optimization strategies allow Optimum to perform performance optimization and fine-grained tradeoff. The Optimum prototype is based on CUDA multi-stream and TensorRT. The test results show that we have a flat 10.3% performance improvement over mainstream single-model deployments. We have a performance improvement of up to 7.09% over the state-of-the-art with an order of magnitude reduction in search overhead.
ISSN:	1383-7621 1873-6165
DOI:	10.1016/j.sysarc.2023.102901