Low-Latency ML Inference by Grouping Correlated Data Objects and Computation
ML inference workflows often require low latency and high throughput, yet we lack good options for addressing this need. Techniques that reduce latency in other streaming settings (such as caching and optimization-driven scheduling) are of limited value because ML data dependencies are often very la...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | ML inference workflows often require low latency and high throughput, yet we
lack good options for addressing this need. Techniques that reduce latency in
other streaming settings (such as caching and optimization-driven scheduling)
are of limited value because ML data dependencies are often very large and can
change dramatically depending on the triggering event. In this work, we propose
a novel correlation grouping mechanism that makes it easier for developers to
express application-specific data access correlations, enabling coordinated
management of data objects in server clusters hosting streaming inference
tasks. Experiments based on a latency-sensitive ML-based application confirm
the limitations of standard techniques while showing that our solution yields
dramatically better performance. The proposed mechanism is able to maintain
significantly lower and more consistent latency, achieves higher node
utilization as workload and scale-out increase, and yet requires only minor
changes to the code implementing the application. |
---|---|
DOI: | 10.48550/arxiv.2312.11488 |