Mix-tower: Light visual question answering framework based on exclusive self-attention mechanism
Visual question answering (VQA) holds the potential to enhance artificial intelligence proficiency in understanding natural language, stimulate advances in computer vision technologies, and expand the range of practical applications. In the current domain of VQA, single-tower architectures suffer fr...
Gespeichert in:
Veröffentlicht in: | Neurocomputing (Amsterdam) 2024-06, Vol.587, p.127686, Article 127686 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Visual question answering (VQA) holds the potential to enhance artificial intelligence proficiency in understanding natural language, stimulate advances in computer vision technologies, and expand the range of practical applications. In the current domain of VQA, single-tower architectures suffer from huge parameter issues, whereas dual-tower architectures face some challenges due to insufficient cross-modal data interactions. To address these issues, we propose a novel Mix-Tower model, which is a simple, lightweight yet effective VQA model. Our model uses the self-attention mechanism Transformer as the base unit. In the pre-training phase, we train the model with only 1.15M data. In addition, we analyze the effect of three factors on the model performance, namely the number of different layers in Transformer, the number of layers in FeedForward Networks, and the combination of different class features. For downstream tasks, the OK-VQA and COCO-QA datasets are applied for performance validation. Experimental results show that our model outperforms the best-known baselines by 7.61% and 7.89% on the OK-VQA and COCO-QA datasets, respectively. Meanwhile, our minimal model has only 35M parameters, significantly smaller than the other baseline models. Various results demonstrate that our model achieves lightweight while maintaining superior performance. Our codes is available at: https://github.com/jianruichen/MixTower.
•The Mix-Tower model combines the strengths of both single-tower and dual-tower models.•The same model architecture is used to process different types of data.•A lightweight FFN module has been incorporated into the Transformer.•Compared to other baseline models, our model achieves superior performance.
[Display omitted] |
---|---|
ISSN: | 0925-2312 1872-8286 |
DOI: | 10.1016/j.neucom.2024.127686 |