An archival perspective on pretraining data
Alongside an explosion in research and development related to large language models, there has been a concomitant rise in the creation of pretraining datasets—massive collections of text, typically scraped from the web. Drawing on the field of archival studies, we analyze pretraining datasets as inf...
Gespeichert in:
Veröffentlicht in: | Patterns (New York, N.Y.) N.Y.), 2024-04, Vol.5 (4), p.100966-100966, Article 100966 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Alongside an explosion in research and development related to large language models, there has been a concomitant rise in the creation of pretraining datasets—massive collections of text, typically scraped from the web. Drawing on the field of archival studies, we analyze pretraining datasets as informal archives—heterogeneous collections of diverse material that mediate access to knowledge. We use this framework to identify impacts of pretraining data creation and use beyond directly shaping model behavior and reveal how choices about what is included in pretraining data necessarily involve subjective decisions about values. In doing so, the archival perspective helps us identify opportunities for researchers who study the social impacts of technology to contribute to confronting the challenges and trade-offs that arise in creating pretraining datasets at this scale.
Large language models have become ubiquitous but depend crucially on the data on which they are trained. These pretraining datasets are themselves distinctive artifacts that are reused, built upon, and made legitimate beyond their role in shaping model outputs. We consider the similarities between pretraining datasets and archives: both are collections of diverse sociocultural materials that mediate knowledge production and thereby confer power to those who select, document, and control access to them. We discuss the limitations of current approaches to assembling pretraining datasets and ask whose voices are amplified or obscured? Who is harmed? Whose perspectives are taken up or assumed as the default? We highlight the need for more research on these datasets and the practices through which they are built and suggest possible paths forward, drawing on ideas from archival studies.
Large language models depend crucially on the data they are trained on. The authors consider how these pretraining datasets, like archives, are diverse, sociocultural collections that mediate knowledge production. They highlight the need for more research on these datasets, and draw on ideas from archival studies to suggest possible paths forward for researchers who study the social impacts of technology. |
---|---|
ISSN: | 2666-3899 2666-3899 |
DOI: | 10.1016/j.patter.2024.100966 |