DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectu...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: DeepSeek-AI, Liu, Aixin, Feng, Bei, Wang, Bin, Wang, Bingxuan, Liu, Bo, Zhao, Chenggang, Dengr, Chengqi, Dai, Damai, Guo, Daya, Ji, Dongjie, Li, Erhang, Lin, Fangyun, Luo, Fuli, Hao, Guangbo, Chen, Guanting, Xu, Hanwei, Gao, Huazuo, Qu, Hui, Cai, J. L, Liang, Jian, Ni, Jiaqi, Li, Jiashi, Qiu, Junjie, Dong, Kai, Gao, Kaige, Zhang, Lecong, Xu, Lei, Xia, Leyi, Zhang, Liyue, Li, Meng, Wang, Miaojun, Zhang, Mingchuan, Li, Mingming, Tian, Ning, Wang, Peiyi, Zhu, Qihao, Du, Qiushi, Jin, R. L, Ge, Ruiqi, Pan, Ruizhe, Xu, Runxin, Li, S. S, Lu, Shanghao, Chen, Shanhuang, Wu, Shaoqing, Ye, Shengfeng, Ma, Shirong, Wang, Shiyu, Zhou, Shuang, Zhou, Shunfeng, Zheng, Size, Wang, T, Yuan, Tian, Zeng, Wangding, An, Wei, Liu, Wen, Liang, Wenfeng, Gao, Wenjun, Zhang, Wentao, Jin, Xiangyue, Wang, Xianzu, Liu, Xiaodong, Wang, Xiaohan, Chen, Xiaokang, Chen, Xiaosha, Nie, Xiaotao, Sun, Xiaowen, Wang, Xiaoxiang, Liu, Xin, Lu, Xuan, Su, Xuecheng, Wu, Y, Li, Y. K, Wei, Y. X, Zhu, Y. X, Xu, Yanhong, Huang, Yanping, Sun, Yaofeng, Wang, Yaohui, Zheng, Yi, Tang, Ying, Piao, Yishi, Dong, Yixin, Liu, Yiyuan, Wang, Yongji, Guo, Yongqiang, Zhu, Yuchen, Wang, Yuduan, Zou, Yuheng, Zha, Yukun, Yan, Yuting, You, Yuxiang, Liu, Yuxuan, Ren, Zehui, Sha, Zhangli, Huang, Zhen, Shao, Zhihong, Wen, Zhiniu, Li, Zhuoshu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.
DOI:10.48550/arxiv.2405.04434