HoME: Hierarchy of Multi-Gate Experts for Multi-Task Learning at Kuaishou
In this paper, we present the practical problems and the lessons learned at short-video services from Kuaishou. In industry, a widely-used multi-task framework is the Mixture-of-Experts (MoE) paradigm, which always introduces some shared and specific experts for each task and then uses gate networks...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In this paper, we present the practical problems and the lessons learned at
short-video services from Kuaishou. In industry, a widely-used multi-task
framework is the Mixture-of-Experts (MoE) paradigm, which always introduces
some shared and specific experts for each task and then uses gate networks to
measure related experts' contributions. Although the MoE achieves remarkable
improvements, we still observe three anomalies that seriously affect model
performances in our iteration: (1) Expert Collapse: We found that experts'
output distributions are significantly different, and some experts have over
90% zero activations with ReLU, making it hard for gate networks to assign fair
weights to balance experts. (2) Expert Degradation: Ideally, the shared-expert
aims to provide predictive information for all tasks simultaneously.
Nevertheless, we find that some shared-experts are occupied by only one task,
which indicates that shared-experts lost their ability but degenerated into
some specific-experts. (3) Expert Underfitting: In our services, we have dozens
of behavior tasks that need to be predicted, but we find that some data-sparse
prediction tasks tend to ignore their specific-experts and assign large weights
to shared-experts. The reason might be that the shared-experts can perceive
more gradient updates and knowledge from dense tasks, while specific-experts
easily fall into underfitting due to their sparse behaviors. Motivated by those
observations, we propose HoME to achieve a simple, efficient and balanced MoE
system for multi-task learning. |
---|---|
DOI: | 10.48550/arxiv.2408.05430 |