From Babble to Words: Pre-Training Language Models on Continuous Streams of Phonemes
Language models are typically trained on large corpora of text in their default orthographic form. However, this is not the only option; representing data as streams of phonemes can offer unique advantages, from deeper insights into phonological language acquisition to improved performance on sound-...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Language models are typically trained on large corpora of text in their
default orthographic form. However, this is not the only option; representing
data as streams of phonemes can offer unique advantages, from deeper insights
into phonological language acquisition to improved performance on sound-based
tasks. The challenge lies in evaluating the impact of phoneme-based training,
as most benchmarks are also orthographic. To address this, we develop a
pipeline to convert text datasets into a continuous stream of phonemes. We
apply this pipeline to the 100-million-word pre-training dataset from the
BabyLM challenge, as well as to standard language and grammatical benchmarks,
enabling us to pre-train and evaluate a model using phonemic input
representations. Our results show that while phoneme-based training slightly
reduces performance on traditional language understanding tasks, it offers
valuable analytical and practical benefits. |
---|---|
DOI: | 10.48550/arxiv.2410.22906 |