Multimodal Graph Transformer for Multimodal Question Answering

Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: He, Xuehai, Wang, Xin Eric
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator He, Xuehai
Wang, Xin Eric
description Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
doi_str_mv 10.48550/arxiv.2305.00581
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2305_00581</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2305_00581</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-ebaa3c89631fb3fb8ff14c26955cb3d5a7e870e1ae0127c0e78484fae68c0e5c3</originalsourceid><addsrcrecordid>eNpNj8kKwjAURbNxIeoHuDI_0Jo0TRM3gogTKCJ0X17jiwY6SOr4944LV4cLlwOHkD5nYaylZEPwd3cNI8FkyJjUvE3Gm0txdmW9h4IuPJyONPVQNbb2JXr6Av077C7YnF1d0UnV3NC76tAlLQtFg70fOySdz9LpMlhvF6vpZB1AoniAOYAwepQIbnNhc20tj02UjKQ0udhLUKgVQw7IeKQMQ6VjHVvARL-GNKJDBl_tJyA7eVeCf2TvkOwTIp79cEQ6</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Multimodal Graph Transformer for Multimodal Question Answering</title><source>arXiv.org</source><creator>He, Xuehai ; Wang, Xin Eric</creator><creatorcontrib>He, Xuehai ; Wang, Xin Eric</creatorcontrib><description>Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.</description><identifier>DOI: 10.48550/arxiv.2305.00581</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2305.00581$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.00581$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>He, Xuehai</creatorcontrib><creatorcontrib>Wang, Xin Eric</creatorcontrib><title>Multimodal Graph Transformer for Multimodal Question Answering</title><description>Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpNj8kKwjAURbNxIeoHuDI_0Jo0TRM3gogTKCJ0X17jiwY6SOr4944LV4cLlwOHkD5nYaylZEPwd3cNI8FkyJjUvE3Gm0txdmW9h4IuPJyONPVQNbb2JXr6Av077C7YnF1d0UnV3NC76tAlLQtFg70fOySdz9LpMlhvF6vpZB1AoniAOYAwepQIbnNhc20tj02UjKQ0udhLUKgVQw7IeKQMQ6VjHVvARL-GNKJDBl_tJyA7eVeCf2TvkOwTIp79cEQ6</recordid><startdate>20230430</startdate><enddate>20230430</enddate><creator>He, Xuehai</creator><creator>Wang, Xin Eric</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230430</creationdate><title>Multimodal Graph Transformer for Multimodal Question Answering</title><author>He, Xuehai ; Wang, Xin Eric</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-ebaa3c89631fb3fb8ff14c26955cb3d5a7e870e1ae0127c0e78484fae68c0e5c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>He, Xuehai</creatorcontrib><creatorcontrib>Wang, Xin Eric</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>He, Xuehai</au><au>Wang, Xin Eric</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multimodal Graph Transformer for Multimodal Question Answering</atitle><date>2023-04-30</date><risdate>2023</risdate><abstract>Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.</abstract><doi>10.48550/arxiv.2305.00581</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2305.00581
ispartof
issn
language eng
recordid cdi_arxiv_primary_2305_00581
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
title Multimodal Graph Transformer for Multimodal Question Answering
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T13%3A35%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multimodal%20Graph%20Transformer%20for%20Multimodal%20Question%20Answering&rft.au=He,%20Xuehai&rft.date=2023-04-30&rft_id=info:doi/10.48550/arxiv.2305.00581&rft_dat=%3Carxiv_GOX%3E2305_00581%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true