PREBA: A Hardware/Software Co-Design for Multi-Instance GPU based AI Inference Servers
NVIDIA's Multi-Instance GPU (MIG) is a feature that enables system designers to reconfigure one large GPU into multiple smaller GPU slices. This work characterizes this emerging GPU and evaluates its effectiveness in designing high-performance AI inference servers. Our study reveals that the da...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | NVIDIA's Multi-Instance GPU (MIG) is a feature that enables system designers
to reconfigure one large GPU into multiple smaller GPU slices. This work
characterizes this emerging GPU and evaluates its effectiveness in designing
high-performance AI inference servers. Our study reveals that the data
preprocessing stage of AI inference causes significant performance bottlenecks
to MIG. To this end, we present PREBA, which is a hardware/software co-design
targeting MIG inference servers. Our first proposition is an FPGA-based data
preprocessing accelerator that unlocks the full potential of MIG with
domain-specific acceleration of data preprocessing. The MIG inference server
unleashed from preprocessing overheads is then augmented with our dynamic
batching system that enables high-performance inference. PREBA is implemented
end-to-end in real systems, providing a 3.7x improvement in throughput, 3.4x
reduction in tail latency, 3.5x improvement in energy-efficiency, and 3.0x
improvement in cost-efficiency. |
---|---|
DOI: | 10.48550/arxiv.2411.19114 |