Automated Backend Allocation for Multi-Model, On-Device AI Inference

On-Device Artificial Intelligence (AI) services such as face recognition, object tracking and voice recognition are rapidly scaling up deployments on embedded, memory-constrained hardware devices. These services typically delegate AI inference models for execution on CPU and GPU computing backends....

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Performance evaluation review 2024-06, Vol.52 (1), p.27-28
Hauptverfasser: Iyer, Venkatraman, Lee, Sungho, Lee, Semun, Kim, Juitem Joonwoo, Kim, Hyunjun, Shin, Youngjae
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:On-Device Artificial Intelligence (AI) services such as face recognition, object tracking and voice recognition are rapidly scaling up deployments on embedded, memory-constrained hardware devices. These services typically delegate AI inference models for execution on CPU and GPU computing backends. While GPU delegation is a common practice to achieve high speed computation, the approach suffers from degraded throughput and completion times under multi-model scenarios, i.e., concurrently executing services. This paper introduces a solution to sustain performance in multi-model, on-device AI contexts by dynamically allocating a combination of CPU and GPU backends per model. The allocation is feedback-driven, and guided by a knowledge of model-specific, multi-objective pareto fronts comprising inference latency and memory consumption. Our backend allocation algorithm that runs online per model, and achieves 25-100% improvement in throughput over static allocations as well as load-balancing scheduler solutions targeting multi-model scenarios.
ISSN:0163-5999
1557-9484
DOI:10.1145/3673660.3655046