Server load and network-aware adaptive deep learning inference offloading for edge platforms

This work presents DIAMOND, a deep neural network computation offloading scheme consisting of a lightweight client-to-server latency profiling component combined with a server inference time estimation module to accurately assess the expected latency of a deep learning model inference. Latency predi...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Internet of things (Amsterdam. Online) 2023-04, Vol.21, p.100644, Article 100644
Hauptverfasser:	Ahn, Jungmo, Lee, Youngki, Ahn, Jeongseob, Ko, JeongGil
Format:	Artikel
Sprache:	eng
Schlagworte:	Computation offloading Deep neural networks Distributed systems Mobile/embedded computing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	This work presents DIAMOND, a deep neural network computation offloading scheme consisting of a lightweight client-to-server latency profiling component combined with a server inference time estimation module to accurately assess the expected latency of a deep learning model inference. Latency predictions for both the network and server are comprehensively used to make dynamic (partial) model offloading decisions at the client in run-time. Compared to previous work, DIAMOND targets to minimize network latency estimation overhead and considers the concurrent processing nature of state-of-the-art deep learning inference server designs. Our extensive evaluations with an NVIDIA Jetson Nano client connected to an NVIDIA Triton server shows that DIAMOND completes inference operations with noticeably reduced computational/energy overhead and latency compared to previously proposed model offloading approaches. Furthermore, our results show that DIAMOND well-adapts to practical server load and network dynamics. •Design of an adaptive deep learning inference offloading scheme for mobile-edge computing.•Efficient resource profiling and latency-aware computation/model offloading decision making.•Inference request processing latency profiler considering high-throughput deep learning servers.•TCP operation-aware network profiler for accurate estimating offloading data communication latency.•Deploy and validate proposed system on state-of-the-art deep learning serving frameworks and mobile/embedded GPU platforms.
ISSN:	2542-6605 2542-6605
DOI:	10.1016/j.iot.2022.100644