MoE-CAP: Cost-Accuracy-Performance Benchmarking for Mixture-of-Experts Systems
The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently; however, MoE systems rely on heterogeneous compute and memory resources. These factors collectively influence the system's Cost, Accuracy, and Performance (CAP), creati...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for
scaling Large Language Models (LLMs) efficiently; however, MoE systems rely on
heterogeneous compute and memory resources. These factors collectively
influence the system's Cost, Accuracy, and Performance (CAP), creating a
challenging trade-off. Current benchmarks often fail to provide precise
estimates of these effects, complicating practical considerations for deploying
MoE systems. To bridge this gap, we introduce MoE-CAP, a benchmark specifically
designed to evaluate MoE systems. Our findings highlight the difficulty of
achieving an optimal balance of cost, accuracy, and performance with existing
hardware capabilities. MoE systems often necessitate compromises on one factor
to optimize the other two, a dynamic we term the MoE-CAP trade-off. To identify
the best trade-off, we propose novel performance evaluation metrics - Sparse
Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)
- and develop cost models that account for the heterogeneous compute and memory
hardware integral to MoE systems. This benchmark is publicly available on
HuggingFace:
https://huggingface.co/spaces/sparse-generative-ai/open-moe-llm-leaderboard. |
---|---|
DOI: | 10.48550/arxiv.2412.07067 |