Modelling Multimodal Integration in Human Concept Processing with Vision-and-Language Models
Representations from deep neural networks (DNNs) have proven remarkably predictive of neural activity involved in both visual and linguistic processing. Despite these successes, most studies to date concern unimodal DNNs, encoding either visual or textual input but not both. Yet, there is growing ev...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Representations from deep neural networks (DNNs) have proven remarkably
predictive of neural activity involved in both visual and linguistic
processing. Despite these successes, most studies to date concern unimodal
DNNs, encoding either visual or textual input but not both. Yet, there is
growing evidence that human meaning representations integrate linguistic and
sensory-motor information. Here we investigate whether the integration of
multimodal information operated by current vision-and-language DNN models
(VLMs) leads to representations that are more aligned with human brain activity
than those obtained by language-only and vision-only DNNs. We focus on fMRI
responses recorded while participants read concept words in the context of
either a full sentence or an accompanying picture. Our results reveal that VLM
representations correlate more strongly than language- and vision-only DNNs
with activations in brain areas functionally related to language processing. A
comparison between different types of visuo-linguistic architectures shows that
recent generative VLMs tend to be less brain-aligned than previous
architectures with lower performance on downstream applications. Moreover,
through an additional analysis comparing brain vs. behavioural alignment across
multiple VLMs, we show that -- with one remarkable exception -- representations
that strongly align with behavioural judgments do not correlate highly with
brain responses. This indicates that brain similarity does not go hand in hand
with behavioural similarity, and vice versa. |
---|---|
DOI: | 10.48550/arxiv.2407.17914 |