PaLI: A Jointly-Scaled Multilingual Language-Image Model
Effective scaling and a flexible task interface enable large language models to excel at many tasks. We present PaLI (Pathways Language and Image model), a model that extends this approach to the joint modeling of language and vision. PaLI generates text based on visual and textual inputs, and with...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Effective scaling and a flexible task interface enable large language models
to excel at many tasks. We present PaLI (Pathways Language and Image model), a
model that extends this approach to the joint modeling of language and vision.
PaLI generates text based on visual and textual inputs, and with this interface
performs many vision, language, and multimodal tasks, in many languages. To
train PaLI, we make use of large pre-trained encoder-decoder language models
and Vision Transformers (ViTs). This allows us to capitalize on their existing
capabilities and leverage the substantial cost of training them. We find that
joint scaling of the vision and language components is important. Since
existing Transformers for language are much larger than their vision
counterparts, we train a large, 4-billion parameter ViT (ViT-e) to quantify the
benefits from even larger-capacity vision models. To train PaLI, we create a
large multilingual mix of pretraining tasks, based on a new image-text training
set containing 10B images and texts in over 100 languages. PaLI achieves
state-of-the-art in multiple vision and language tasks (such as captioning,
visual question-answering, scene-text understanding), while retaining a simple,
modular, and scalable design. |
---|---|
DOI: | 10.48550/arxiv.2209.06794 |