Object-difference drived graph convolutional networks for visual question answering

Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Multimedia tools and applications 2021-05, Vol.80 (11), p.16247-16265
Hauptverfasser:	Zhu, Xi, Mao, Zhendong, Chen, Zhineng, Li, Yangyang, Wang, Zhaohui, Wang, Bin
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial intelligence Artificial neural networks Computer Communication Networks Computer Science Computer vision Critical point Data Structures and Information Theory Datasets Feature extraction Graphical representations Language Multimedia Multimedia Information Systems Natural language processing Object recognition Questions Redundancy Semantics Special Purpose and Application-Based Systems
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	16265
container_issue	11
container_start_page	16247
container_title	Multimedia tools and applications
container_volume	80
creator	Zhu, Xi Mao, Zhendong Chen, Zhineng Li, Yangyang Wang, Zhaohui Wang, Bin
description	Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest in VQA, for its potential of modeling the relationships between objects as well as its formidable interpretability. Nonetheless, those solutions mainly define the similarity between objects as their semantical relationships, while largely ignoring the critical point that the difference between objects can provide more information for establishing the relationship between nodes in the graph. To achieve this, we propose an object-difference based graph learner, which learns question-adaptive semantic relations by calculating inter-object difference under the guidance of questions. With the learned relationships, the input image can be represented as an object graph encoded with structural dependencies between objects. In addition, existing graph-based models leverage the pre-extracted object boxes by the object detection model as node features for convenience, but they are suffering from the redundancy problem. To reduce the redundant objects, we introduce a soft-attention mechanism to magnify the question-related objects. Moreover, we incorporate our object-difference based graph learner into the soft-attention based Graph Convolutional Networks to capture question-specific objects and their interactions for answer prediction. Our experimental results on the VQA 2.0 dataset demonstrate that our model gives significantly better performance than baseline methods.
doi_str_mv	10.1007/s11042-020-08790-0
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2529006818</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2529006818</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-df238ba9d6c9104890496816ef524796f46033eee7adaebb284612eb0804b4533</originalsourceid><addsrcrecordid>eNp9UMFOwzAMjRBIjMEPcKrEOeAkbZMc0QQMadIOwDlKW2d0jKYk6yb-nmxF4sbFtmy_5-dHyDWDWwYg7yJjkHMKHCgoqVM8IRNWSEGl5Ow01UIBlQWwc3IR4xqAlQXPJ-RlWa2x3tKmdQ4DdjVmTWh32GSrYPv3rPbdzm-Gbes7u8k63O59-IiZ8yHbtXFIva8B42Gc2S7uMbTd6pKcObuJePWbp-Tt8eF1NqeL5dPz7H5Ba8F0Oum4UJXVTVnrpF5pyHWpWIkuKZO6dHkJQiCitI3FquIqLxnHChTkVV4IMSU3I28f_FGFWfshJJ3R8IJrgMSm0hYft-rgYwzoTB_aTxu-DQNzMM-M5plknjmaZyCBxAiK_eEjDH_U_6B-AHzFco8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2529006818</pqid></control><display><type>article</type><title>Object-difference drived graph convolutional networks for visual question answering</title><source>SpringerLink Journals - AutoHoldings</source><creator>Zhu, Xi ; Mao, Zhendong ; Chen, Zhineng ; Li, Yangyang ; Wang, Zhaohui ; Wang, Bin</creator><creatorcontrib>Zhu, Xi ; Mao, Zhendong ; Chen, Zhineng ; Li, Yangyang ; Wang, Zhaohui ; Wang, Bin</creatorcontrib><description>Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest in VQA, for its potential of modeling the relationships between objects as well as its formidable interpretability. Nonetheless, those solutions mainly define the similarity between objects as their semantical relationships, while largely ignoring the critical point that the difference between objects can provide more information for establishing the relationship between nodes in the graph. To achieve this, we propose an object-difference based graph learner, which learns question-adaptive semantic relations by calculating inter-object difference under the guidance of questions. With the learned relationships, the input image can be represented as an object graph encoded with structural dependencies between objects. In addition, existing graph-based models leverage the pre-extracted object boxes by the object detection model as node features for convenience, but they are suffering from the redundancy problem. To reduce the redundant objects, we introduce a soft-attention mechanism to magnify the question-related objects. Moreover, we incorporate our object-difference based graph learner into the soft-attention based Graph Convolutional Networks to capture question-specific objects and their interactions for answer prediction. Our experimental results on the VQA 2.0 dataset demonstrate that our model gives significantly better performance than baseline methods.</description><identifier>ISSN: 1380-7501</identifier><identifier>EISSN: 1573-7721</identifier><identifier>DOI: 10.1007/s11042-020-08790-0</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial intelligence ; Artificial neural networks ; Computer Communication Networks ; Computer Science ; Computer vision ; Critical point ; Data Structures and Information Theory ; Datasets ; Feature extraction ; Graphical representations ; Language ; Multimedia ; Multimedia Information Systems ; Natural language processing ; Object recognition ; Questions ; Redundancy ; Semantics ; Special Purpose and Application-Based Systems</subject><ispartof>Multimedia tools and applications, 2021-05, Vol.80 (11), p.16247-16265</ispartof><rights>Springer Science+Business Media, LLC, part of Springer Nature 2020</rights><rights>Springer Science+Business Media, LLC, part of Springer Nature 2020.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-df238ba9d6c9104890496816ef524796f46033eee7adaebb284612eb0804b4533</citedby><cites>FETCH-LOGICAL-c319t-df238ba9d6c9104890496816ef524796f46033eee7adaebb284612eb0804b4533</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11042-020-08790-0$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11042-020-08790-0$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Zhu, Xi</creatorcontrib><creatorcontrib>Mao, Zhendong</creatorcontrib><creatorcontrib>Chen, Zhineng</creatorcontrib><creatorcontrib>Li, Yangyang</creatorcontrib><creatorcontrib>Wang, Zhaohui</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><title>Object-difference drived graph convolutional networks for visual question answering</title><title>Multimedia tools and applications</title><addtitle>Multimed Tools Appl</addtitle><description>Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest in VQA, for its potential of modeling the relationships between objects as well as its formidable interpretability. Nonetheless, those solutions mainly define the similarity between objects as their semantical relationships, while largely ignoring the critical point that the difference between objects can provide more information for establishing the relationship between nodes in the graph. To achieve this, we propose an object-difference based graph learner, which learns question-adaptive semantic relations by calculating inter-object difference under the guidance of questions. With the learned relationships, the input image can be represented as an object graph encoded with structural dependencies between objects. In addition, existing graph-based models leverage the pre-extracted object boxes by the object detection model as node features for convenience, but they are suffering from the redundancy problem. To reduce the redundant objects, we introduce a soft-attention mechanism to magnify the question-related objects. Moreover, we incorporate our object-difference based graph learner into the soft-attention based Graph Convolutional Networks to capture question-specific objects and their interactions for answer prediction. Our experimental results on the VQA 2.0 dataset demonstrate that our model gives significantly better performance than baseline methods.</description><subject>Artificial intelligence</subject><subject>Artificial neural networks</subject><subject>Computer Communication Networks</subject><subject>Computer Science</subject><subject>Computer vision</subject><subject>Critical point</subject><subject>Data Structures and Information Theory</subject><subject>Datasets</subject><subject>Feature extraction</subject><subject>Graphical representations</subject><subject>Language</subject><subject>Multimedia</subject><subject>Multimedia Information Systems</subject><subject>Natural language processing</subject><subject>Object recognition</subject><subject>Questions</subject><subject>Redundancy</subject><subject>Semantics</subject><subject>Special Purpose and Application-Based Systems</subject><issn>1380-7501</issn><issn>1573-7721</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>8G5</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNp9UMFOwzAMjRBIjMEPcKrEOeAkbZMc0QQMadIOwDlKW2d0jKYk6yb-nmxF4sbFtmy_5-dHyDWDWwYg7yJjkHMKHCgoqVM8IRNWSEGl5Ow01UIBlQWwc3IR4xqAlQXPJ-RlWa2x3tKmdQ4DdjVmTWh32GSrYPv3rPbdzm-Gbes7u8k63O59-IiZ8yHbtXFIva8B42Gc2S7uMbTd6pKcObuJePWbp-Tt8eF1NqeL5dPz7H5Ba8F0Oum4UJXVTVnrpF5pyHWpWIkuKZO6dHkJQiCitI3FquIqLxnHChTkVV4IMSU3I28f_FGFWfshJJ3R8IJrgMSm0hYft-rgYwzoTB_aTxu-DQNzMM-M5plknjmaZyCBxAiK_eEjDH_U_6B-AHzFco8</recordid><startdate>20210501</startdate><enddate>20210501</enddate><creator>Zhu, Xi</creator><creator>Mao, Zhendong</creator><creator>Chen, Zhineng</creator><creator>Li, Yangyang</creator><creator>Wang, Zhaohui</creator><creator>Wang, Bin</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope></search><sort><creationdate>20210501</creationdate><title>Object-difference drived graph convolutional networks for visual question answering</title><author>Zhu, Xi ; Mao, Zhendong ; Chen, Zhineng ; Li, Yangyang ; Wang, Zhaohui ; Wang, Bin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-df238ba9d6c9104890496816ef524796f46033eee7adaebb284612eb0804b4533</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Artificial intelligence</topic><topic>Artificial neural networks</topic><topic>Computer Communication Networks</topic><topic>Computer Science</topic><topic>Computer vision</topic><topic>Critical point</topic><topic>Data Structures and Information Theory</topic><topic>Datasets</topic><topic>Feature extraction</topic><topic>Graphical representations</topic><topic>Language</topic><topic>Multimedia</topic><topic>Multimedia Information Systems</topic><topic>Natural language processing</topic><topic>Object recognition</topic><topic>Questions</topic><topic>Redundancy</topic><topic>Semantics</topic><topic>Special Purpose and Application-Based Systems</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhu, Xi</creatorcontrib><creatorcontrib>Mao, Zhendong</creatorcontrib><creatorcontrib>Chen, Zhineng</creatorcontrib><creatorcontrib>Li, Yangyang</creatorcontrib><creatorcontrib>Wang, Zhaohui</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>Access via ABI/INFORM (ProQuest)</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>One Business (ProQuest)</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Multimedia tools and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhu, Xi</au><au>Mao, Zhendong</au><au>Chen, Zhineng</au><au>Li, Yangyang</au><au>Wang, Zhaohui</au><au>Wang, Bin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Object-difference drived graph convolutional networks for visual question answering</atitle><jtitle>Multimedia tools and applications</jtitle><stitle>Multimed Tools Appl</stitle><date>2021-05-01</date><risdate>2021</risdate><volume>80</volume><issue>11</issue><spage>16247</spage><epage>16265</epage><pages>16247-16265</pages><issn>1380-7501</issn><eissn>1573-7721</eissn><abstract>Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest in VQA, for its potential of modeling the relationships between objects as well as its formidable interpretability. Nonetheless, those solutions mainly define the similarity between objects as their semantical relationships, while largely ignoring the critical point that the difference between objects can provide more information for establishing the relationship between nodes in the graph. To achieve this, we propose an object-difference based graph learner, which learns question-adaptive semantic relations by calculating inter-object difference under the guidance of questions. With the learned relationships, the input image can be represented as an object graph encoded with structural dependencies between objects. In addition, existing graph-based models leverage the pre-extracted object boxes by the object detection model as node features for convenience, but they are suffering from the redundancy problem. To reduce the redundant objects, we introduce a soft-attention mechanism to magnify the question-related objects. Moreover, we incorporate our object-difference based graph learner into the soft-attention based Graph Convolutional Networks to capture question-specific objects and their interactions for answer prediction. Our experimental results on the VQA 2.0 dataset demonstrate that our model gives significantly better performance than baseline methods.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11042-020-08790-0</doi><tpages>19</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1380-7501
ispartof	Multimedia tools and applications, 2021-05, Vol.80 (11), p.16247-16265
issn	1380-7501 1573-7721
language	eng
recordid	cdi_proquest_journals_2529006818
source	SpringerLink Journals - AutoHoldings
subjects	Artificial intelligence Artificial neural networks Computer Communication Networks Computer Science Computer vision Critical point Data Structures and Information Theory Datasets Feature extraction Graphical representations Language Multimedia Multimedia Information Systems Natural language processing Object recognition Questions Redundancy Semantics Special Purpose and Application-Based Systems
title	Object-difference drived graph convolutional networks for visual question answering
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T21%3A21%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Object-difference%20drived%20graph%20convolutional%20networks%20for%20visual%20question%20answering&rft.jtitle=Multimedia%20tools%20and%20applications&rft.au=Zhu,%20Xi&rft.date=2021-05-01&rft.volume=80&rft.issue=11&rft.spage=16247&rft.epage=16265&rft.pages=16247-16265&rft.issn=1380-7501&rft.eissn=1573-7721&rft_id=info:doi/10.1007/s11042-020-08790-0&rft_dat=%3Cproquest_cross%3E2529006818%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2529006818&rft_id=info:pmid/&rfr_iscdi=true