HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices
In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory inference but often suffer from efficiency due to I/O bottlenecks...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In recent times, the emergence of Large Language Models (LLMs) has resulted
in increasingly larger model size, posing challenges for inference on
low-resource devices. Prior approaches have explored offloading to facilitate
low-memory inference but often suffer from efficiency due to I/O bottlenecks.
To achieve low-latency LLMs inference on resource-constrained devices, we
introduce HeteGen, a novel approach that presents a principled framework for
heterogeneous parallel computing using CPUs and GPUs. Based on this framework,
HeteGen further employs heterogeneous parallel computing and asynchronous
overlap for LLMs to mitigate I/O bottlenecks. Our experiments demonstrate a
substantial improvement in inference speed, surpassing state-of-the-art methods
by over 317% at most. |
---|---|
DOI: | 10.48550/arxiv.2403.01164 |