Multimodal Graph Transformer for Multimodal Question Answering

Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	He, Xuehai, Wang, Xin Eric
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	He, Xuehai Wang, Xin Eric
description	Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
doi_str_mv	10.48550/arxiv.2305.00581
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2305_00581</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2305_00581</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-ebaa3c89631fb3fb8ff14c26955cb3d5a7e870e1ae0127c0e78484fae68c0e5c3</originalsourceid><addsrcrecordid>eNpNj8kKwjAURbNxIeoHuDI_0Jo0TRM3gogTKCJ0X17jiwY6SOr4944LV4cLlwOHkD5nYaylZEPwd3cNI8FkyJjUvE3Gm0txdmW9h4IuPJyONPVQNbb2JXr6Av077C7YnF1d0UnV3NC76tAlLQtFg70fOySdz9LpMlhvF6vpZB1AoniAOYAwepQIbnNhc20tj02UjKQ0udhLUKgVQw7IeKQMQ6VjHVvARL-GNKJDBl_tJyA7eVeCf2TvkOwTIp79cEQ6</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Multimodal Graph Transformer for Multimodal Question Answering</title><source>arXiv.org</source><creator>He, Xuehai ; Wang, Xin Eric</creator><creatorcontrib>He, Xuehai ; Wang, Xin Eric</creatorcontrib><description>Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.</description><identifier>DOI: 10.48550/arxiv.2305.00581</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2305.00581$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2305.00581$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>He, Xuehai</creatorcontrib><creatorcontrib>Wang, Xin Eric</creatorcontrib><title>Multimodal Graph Transformer for Multimodal Question Answering</title><description>Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpNj8kKwjAURbNxIeoHuDI_0Jo0TRM3gogTKCJ0X17jiwY6SOr4944LV4cLlwOHkD5nYaylZEPwd3cNI8FkyJjUvE3Gm0txdmW9h4IuPJyONPVQNbb2JXr6Av077C7YnF1d0UnV3NC76tAlLQtFg70fOySdz9LpMlhvF6vpZB1AoniAOYAwepQIbnNhc20tj02UjKQ0udhLUKgVQw7IeKQMQ6VjHVvARL-GNKJDBl_tJyA7eVeCf2TvkOwTIp79cEQ6</recordid><startdate>20230430</startdate><enddate>20230430</enddate><creator>He, Xuehai</creator><creator>Wang, Xin Eric</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230430</creationdate><title>Multimodal Graph Transformer for Multimodal Question Answering</title><author>He, Xuehai ; Wang, Xin Eric</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-ebaa3c89631fb3fb8ff14c26955cb3d5a7e870e1ae0127c0e78484fae68c0e5c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>He, Xuehai</creatorcontrib><creatorcontrib>Wang, Xin Eric</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>He, Xuehai</au><au>Wang, Xin Eric</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Multimodal Graph Transformer for Multimodal Question Answering</atitle><date>2023-04-30</date><risdate>2023</risdate><abstract>Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as graph neural networks (GNNs) that integrate prior information can barely compete with Transformer models. In this work, we aim to benefit from both worlds and propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities. We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information, acquired from text and visual data, to the vanilla self-attention as effective prior. In particular, we construct the text graph, dense region graph, and semantic graph to generate adjacency matrices, and then compose them with input vision and language features to perform downstream reasoning. Such a way of regularizing self-attention with graph information significantly improves the inferring ability and helps align features from different modalities. We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.</abstract><doi>10.48550/arxiv.2305.00581</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2305.00581
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2305_00581
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition
title	Multimodal Graph Transformer for Multimodal Question Answering
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T13%3A35%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Multimodal%20Graph%20Transformer%20for%20Multimodal%20Question%20Answering&rft.au=He,%20Xuehai&rft.date=2023-04-30&rft_id=info:doi/10.48550/arxiv.2305.00581&rft_dat=%3Carxiv_GOX%3E2305_00581%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true