Acquiring Linguistic Knowledge from Multimodal Input
In contrast to children, language models (LMs) exhibit considerably inferior data efficiency when acquiring language. In this submission to the BabyLM Challenge (Warstadt et al., 2023), we test the hypothesis that this data efficiency gap is partly caused by a lack of multimodal input and grounding...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In contrast to children, language models (LMs) exhibit considerably inferior
data efficiency when acquiring language. In this submission to the BabyLM
Challenge (Warstadt et al., 2023), we test the hypothesis that this data
efficiency gap is partly caused by a lack of multimodal input and grounding in
the learning environment of typical language models. Although previous work
looking into this question found that multimodal training can even harm
language-only performance, we speculate that these findings can be attributed
to catastrophic forgetting of complex language due to fine-tuning on captions
data. To test our hypothesis, we perform an ablation study on FLAVA (Singh et
al., 2022), a multimodal vision-and-language model, independently varying the
volume of text and vision input to quantify how much text data (if any) can be
offset by vision at different data scales. We aim to limit catastrophic
forgetting through a multitask pretraining regime that includes unimodal
text-only tasks and data sampled from WiT, the relatively diverse
Wikipedia-based dataset (Srinivasan et al., 2021). Our results are largely
negative: Multimodal pretraining does not harm our models' language performance
but does not consistently help either. That said, our conclusions are limited
by our having been able to conduct only a small number of runs. While we must
leave open the possibility that multimodal input explains some of the gap in
data efficiency between LMs and humans, positive evidence for this hypothesis
will require better architectures and techniques for multimodal training. |
---|---|
DOI: | 10.48550/arxiv.2402.17936 |