On the Impact of Cross-Domain Data on German Language Models
Traditionally, large language models have been either trained on general web crawls or domain-specific data. However, recent successes of generative large language models, have shed light on the benefits of cross-domain datasets. To examine the significance of prioritizing data diversity over qualit...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Traditionally, large language models have been either trained on general web
crawls or domain-specific data. However, recent successes of generative large
language models, have shed light on the benefits of cross-domain datasets. To
examine the significance of prioritizing data diversity over quality, we
present a German dataset comprising texts from five domains, along with another
dataset aimed at containing high-quality data. Through training a series of
models ranging between 122M and 750M parameters on both datasets, we conduct a
comprehensive benchmark on multiple downstream tasks. Our findings demonstrate
that the models trained on the cross-domain dataset outperform those trained on
quality data alone, leading to improvements up to $4.45\%$ over the previous
state-of-the-art. The models are available at
https://huggingface.co/ikim-uk-essen |
---|---|
DOI: | 10.48550/arxiv.2310.07321 |