FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network Training
A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on alternative devices with slower access (e.g. CP...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A key performance bottleneck when training graph neural network (GNN) models
on large, real-world graphs is loading node features onto a GPU. Due to limited
GPU memory, expensive data movement is necessary to facilitate the storage of
these features on alternative devices with slower access (e.g. CPU memory).
Moreover, the irregularity of graph structures contributes to poor data
locality which further exacerbates the problem. Consequently, existing
frameworks capable of efficiently training large GNN models usually incur a
significant accuracy degradation because of the currently-available shortcuts
involved. To address these limitations, we instead propose FreshGNN, a
general-purpose GNN mini-batch training framework that leverages a historical
cache for storing and reusing GNN node embeddings instead of re-computing them
through fetching raw features at every iteration. Critical to its success, the
corresponding cache policy is designed, using a combination of gradient-based
and staleness criteria, to selectively screen those embeddings which are
relatively stable and can be cached, from those that need to be re-computed to
reduce estimation errors and subsequent downstream accuracy loss. When paired
with complementary system enhancements to support this selective historical
cache, FreshGNN is able to accelerate the training speed on large graph
datasets such as ogbn-papers100M and MAG240M by 3.4x up to 20.5x and reduce the
memory access by 59%, with less than 1% influence on test accuracy. |
---|---|
DOI: | 10.48550/arxiv.2301.07482 |