Stochastic Communication Avoidance for Recommendation Systems
Conference on Artificial Intelligence (IEEE CAI) 2024 One of the major bottlenecks for efficient deployment of neural network based recommendation systems is the memory footprint of their embedding tables. Although many neural network based recommendation systems could benefit from the faster on-chi...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Conference on Artificial Intelligence (IEEE CAI) 2024 One of the major bottlenecks for efficient deployment of neural network based
recommendation systems is the memory footprint of their embedding tables.
Although many neural network based recommendation systems could benefit from
the faster on-chip memory access and increased computational power of hardware
accelerators, the large embedding tables in these models often cannot fit on
the constrained memory of accelerators. Despite the pervasiveness of these
models, prior methods in memory optimization and parallelism fail to address
the memory and communication costs of large embedding tables on accelerators.
As a result, the majority of models are trained on CPUs, while current
implementations of accelerators are hindered by issues such as bottlenecks in
inter-device communication and main memory lookups. In this paper, we propose a
theoretical framework that analyses the communication costs of arbitrary
distributed systems that use lookup tables. We use this framework to propose
algorithms that maximize throughput subject to memory, computation, and
communication constraints. Furthermore, we demonstrate that our method achieves
strong theoretical performance across dataset distributions and memory
constraints, applicable to a wide range of use cases from mobile federated
learning to warehouse-scale computation. We implement our framework and
algorithms in PyTorch and achieve up to 6x increases in training throughput on
GPU systems over baselines, on the Criteo Terabytes dataset. |
---|---|
DOI: | 10.48550/arxiv.2411.01611 |