YuLan: An Open-source Large Language Model
Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and dev...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Large language models (LLMs) have become the foundation of many applications,
leveraging their extensive capabilities in processing and understanding natural
language. While many open-source LLMs have been released with technical
reports, the lack of training details hinders further research and development.
This paper presents the development of YuLan, a series of open-source LLMs with
$12$ billion parameters. The base model of YuLan is pre-trained on
approximately $1.7$T tokens derived from a diverse corpus, including massive
English, Chinese, and multilingual texts. We design a three-stage pre-training
method to enhance YuLan's overall capabilities. Subsequent phases of training
incorporate instruction-tuning and human alignment, employing a substantial
volume of high-quality synthesized data. To facilitate the learning of complex
and long-tail knowledge, we devise a curriculum-learning framework throughout
across these stages, which helps LLMs learn knowledge in an easy-to-hard
manner. YuLan's training is finished on Jan, 2024 and has achieved performance
on par with state-of-the-art LLMs across various English and Chinese
benchmarks. This paper outlines a comprehensive technical roadmap for
developing LLMs from scratch. Our model and codes are available at
https://github.com/RUC-GSAI/YuLan-Chat. |
---|---|
DOI: | 10.48550/arxiv.2406.19853 |