Towards building multilingual language model for medicine

The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B toke...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Nature communications 2024-09, Vol.15 (1), p.8384-15, Article 8384
Hauptverfasser:	Qiu, Pengcheng, Wu, Chaoyi, Zhang, Xiaoman, Lin, Weixiong, Wang, Haicheng, Zhang, Ya, Wang, Yanfeng, Xie, Weidi
Format:	Artikel
Sprache:	eng
Schlagworte:	631/114/1305 692/700/228 692/700/478 Benchmarking Benchmarks Humanities and Social Sciences Humans Language Large language models multidisciplinary Multilingualism Natural Language Processing Science Science (multidisciplinary)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The development of open-source, multilingual medical language models can benefit a wide, linguistically diverse audience from different regions. To promote this domain, we present contributions from the following: First, we construct a multilingual medical corpus, containing approximately 25.5B tokens encompassing 6 main languages, termed as MMedC, enabling auto-regressive domain adaptation for general LLMs; Second, to monitor the development of multilingual medical LLMs, we propose a multilingual medical multi-choice question-answering benchmark with rationale, termed as MMedBench; Third, we have assessed a number of open-source large language models (LLMs) on our benchmark, along with those further auto-regressive trained on MMedC. Our final model, MMed-Llama 3, with only 8B parameters, achieves superior performance compared to all other open-source models on both MMedBench and English benchmarks, even rivaling GPT-4. In conclusion, in this work, We present a large-scale corpus, a benchmark and a series of models to support the development of multilingual medical LLMs. Open-source, multilingual medical LLMs can benefit a wide audience from different regions. Here, the authors present a large-scale corpus, a benchmark, and a series of LLMs openly to promote development in this field.
ISSN:	2041-1723 2041-1723
DOI:	10.1038/s41467-024-52417-z