A Novel Pretrained General-purpose Vision Language Model for the Vietnamese Language

Lying in the cross-section of computer vision and natural language processing, vision language models are capable of processing images and text at once. These models are helpful in various tasks: text generation from image and vice versa, image-text retrieval, or visual navigation. Besides building...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on Asian and low-resource language information processing 2024-05, Vol.23 (5), p.1-16, Article 66
Hauptverfasser: Vu, Dinh Anh, Pham, Quang Nhat Minh, Tran, Giang Son
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Lying in the cross-section of computer vision and natural language processing, vision language models are capable of processing images and text at once. These models are helpful in various tasks: text generation from image and vice versa, image-text retrieval, or visual navigation. Besides building a model trained on a dataset for a task, people also study general-purpose models to utilize many datasets for multitasks. Their two primary applications are image captioning and visual question answering. For English, large datasets and foundation models are already abundant. However, for Vietnamese, they are still limited. To expand the language range, this work proposes a pretrained general-purpose image-text model named VisualRoBERTa. A dataset of 600k images with captions (translated MS COCO 2017 from English to Vietnamese) is introduced to pretrain VisualRoBERTa. The model’s architecture is built using Convolutional Neural Network and Transformer blocks. Fine-tuning VisualRoBERTa shows promising results on the ViVQA dataset with 34.49% accuracy, 0.4173 BLEU 4, and 0.4390 RougeL (in visual question answering task), and best outcomes on the sViIC dataset with 0.6685 BLEU 4, 0.6320 RougeL (in image captioning task).
ISSN:2375-4699
2375-4702
DOI:10.1145/3654796