BERTtime Stories: Investigating the Role of Synthetic Story Data in Language pre-training
We describe our contribution to the Strict and Strict-Small tracks of the 2nd iteration of the BabyLM Challenge. The shared task is centered around efficient pre-training given data constraints motivated by human development. In response, we study the effect of synthetic story data in language pre-t...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We describe our contribution to the Strict and Strict-Small tracks of the 2nd
iteration of the BabyLM Challenge. The shared task is centered around efficient
pre-training given data constraints motivated by human development. In
response, we study the effect of synthetic story data in language pre-training
using TinyStories: a recently introduced dataset of short stories. Initially,
we train GPT-Neo models on subsets of TinyStories, while varying the amount of
available data. We find that, even with access to less than 100M words, the
models are able to generate high-quality, original completions to a given
story, and acquire substantial linguistic knowledge. To measure the effect of
synthetic story data, we train LTG-BERT encoder models on a combined dataset
of: a subset of TinyStories, story completions generated by GPT-Neo, and a
subset of the BabyLM dataset. Our experimentation reveals that synthetic data
can occasionally offer modest gains, but overall have a negative influence on
linguistic understanding. Our work offers an initial study on synthesizing
story data in low resource settings and underscores their potential for
augmentation in data-constrained language modeling. We publicly release our
models and implementation on our GitHub. |
---|---|
DOI: | 10.48550/arxiv.2410.15365 |