Object-difference drived graph convolutional networks for visual question answering

Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest i...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Multimedia tools and applications 2021-05, Vol.80 (11), p.16247-16265
Hauptverfasser: Zhu, Xi, Mao, Zhendong, Chen, Zhineng, Li, Yangyang, Wang, Zhaohui, Wang, Bin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 16265
container_issue 11
container_start_page 16247
container_title Multimedia tools and applications
container_volume 80
creator Zhu, Xi
Mao, Zhendong
Chen, Zhineng
Li, Yangyang
Wang, Zhaohui
Wang, Bin
description Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest in VQA, for its potential of modeling the relationships between objects as well as its formidable interpretability. Nonetheless, those solutions mainly define the similarity between objects as their semantical relationships, while largely ignoring the critical point that the difference between objects can provide more information for establishing the relationship between nodes in the graph. To achieve this, we propose an object-difference based graph learner, which learns question-adaptive semantic relations by calculating inter-object difference under the guidance of questions. With the learned relationships, the input image can be represented as an object graph encoded with structural dependencies between objects. In addition, existing graph-based models leverage the pre-extracted object boxes by the object detection model as node features for convenience, but they are suffering from the redundancy problem. To reduce the redundant objects, we introduce a soft-attention mechanism to magnify the question-related objects. Moreover, we incorporate our object-difference based graph learner into the soft-attention based Graph Convolutional Networks to capture question-specific objects and their interactions for answer prediction. Our experimental results on the VQA 2.0 dataset demonstrate that our model gives significantly better performance than baseline methods.
doi_str_mv 10.1007/s11042-020-08790-0
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2529006818</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2529006818</sourcerecordid><originalsourceid>FETCH-LOGICAL-c319t-df238ba9d6c9104890496816ef524796f46033eee7adaebb284612eb0804b4533</originalsourceid><addsrcrecordid>eNp9UMFOwzAMjRBIjMEPcKrEOeAkbZMc0QQMadIOwDlKW2d0jKYk6yb-nmxF4sbFtmy_5-dHyDWDWwYg7yJjkHMKHCgoqVM8IRNWSEGl5Ow01UIBlQWwc3IR4xqAlQXPJ-RlWa2x3tKmdQ4DdjVmTWh32GSrYPv3rPbdzm-Gbes7u8k63O59-IiZ8yHbtXFIva8B42Gc2S7uMbTd6pKcObuJePWbp-Tt8eF1NqeL5dPz7H5Ba8F0Oum4UJXVTVnrpF5pyHWpWIkuKZO6dHkJQiCitI3FquIqLxnHChTkVV4IMSU3I28f_FGFWfshJJ3R8IJrgMSm0hYft-rgYwzoTB_aTxu-DQNzMM-M5plknjmaZyCBxAiK_eEjDH_U_6B-AHzFco8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2529006818</pqid></control><display><type>article</type><title>Object-difference drived graph convolutional networks for visual question answering</title><source>SpringerLink Journals - AutoHoldings</source><creator>Zhu, Xi ; Mao, Zhendong ; Chen, Zhineng ; Li, Yangyang ; Wang, Zhaohui ; Wang, Bin</creator><creatorcontrib>Zhu, Xi ; Mao, Zhendong ; Chen, Zhineng ; Li, Yangyang ; Wang, Zhaohui ; Wang, Bin</creatorcontrib><description>Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest in VQA, for its potential of modeling the relationships between objects as well as its formidable interpretability. Nonetheless, those solutions mainly define the similarity between objects as their semantical relationships, while largely ignoring the critical point that the difference between objects can provide more information for establishing the relationship between nodes in the graph. To achieve this, we propose an object-difference based graph learner, which learns question-adaptive semantic relations by calculating inter-object difference under the guidance of questions. With the learned relationships, the input image can be represented as an object graph encoded with structural dependencies between objects. In addition, existing graph-based models leverage the pre-extracted object boxes by the object detection model as node features for convenience, but they are suffering from the redundancy problem. To reduce the redundant objects, we introduce a soft-attention mechanism to magnify the question-related objects. Moreover, we incorporate our object-difference based graph learner into the soft-attention based Graph Convolutional Networks to capture question-specific objects and their interactions for answer prediction. Our experimental results on the VQA 2.0 dataset demonstrate that our model gives significantly better performance than baseline methods.</description><identifier>ISSN: 1380-7501</identifier><identifier>EISSN: 1573-7721</identifier><identifier>DOI: 10.1007/s11042-020-08790-0</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Artificial intelligence ; Artificial neural networks ; Computer Communication Networks ; Computer Science ; Computer vision ; Critical point ; Data Structures and Information Theory ; Datasets ; Feature extraction ; Graphical representations ; Language ; Multimedia ; Multimedia Information Systems ; Natural language processing ; Object recognition ; Questions ; Redundancy ; Semantics ; Special Purpose and Application-Based Systems</subject><ispartof>Multimedia tools and applications, 2021-05, Vol.80 (11), p.16247-16265</ispartof><rights>Springer Science+Business Media, LLC, part of Springer Nature 2020</rights><rights>Springer Science+Business Media, LLC, part of Springer Nature 2020.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c319t-df238ba9d6c9104890496816ef524796f46033eee7adaebb284612eb0804b4533</citedby><cites>FETCH-LOGICAL-c319t-df238ba9d6c9104890496816ef524796f46033eee7adaebb284612eb0804b4533</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11042-020-08790-0$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11042-020-08790-0$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Zhu, Xi</creatorcontrib><creatorcontrib>Mao, Zhendong</creatorcontrib><creatorcontrib>Chen, Zhineng</creatorcontrib><creatorcontrib>Li, Yangyang</creatorcontrib><creatorcontrib>Wang, Zhaohui</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><title>Object-difference drived graph convolutional networks for visual question answering</title><title>Multimedia tools and applications</title><addtitle>Multimed Tools Appl</addtitle><description>Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest in VQA, for its potential of modeling the relationships between objects as well as its formidable interpretability. Nonetheless, those solutions mainly define the similarity between objects as their semantical relationships, while largely ignoring the critical point that the difference between objects can provide more information for establishing the relationship between nodes in the graph. To achieve this, we propose an object-difference based graph learner, which learns question-adaptive semantic relations by calculating inter-object difference under the guidance of questions. With the learned relationships, the input image can be represented as an object graph encoded with structural dependencies between objects. In addition, existing graph-based models leverage the pre-extracted object boxes by the object detection model as node features for convenience, but they are suffering from the redundancy problem. To reduce the redundant objects, we introduce a soft-attention mechanism to magnify the question-related objects. Moreover, we incorporate our object-difference based graph learner into the soft-attention based Graph Convolutional Networks to capture question-specific objects and their interactions for answer prediction. Our experimental results on the VQA 2.0 dataset demonstrate that our model gives significantly better performance than baseline methods.</description><subject>Artificial intelligence</subject><subject>Artificial neural networks</subject><subject>Computer Communication Networks</subject><subject>Computer Science</subject><subject>Computer vision</subject><subject>Critical point</subject><subject>Data Structures and Information Theory</subject><subject>Datasets</subject><subject>Feature extraction</subject><subject>Graphical representations</subject><subject>Language</subject><subject>Multimedia</subject><subject>Multimedia Information Systems</subject><subject>Natural language processing</subject><subject>Object recognition</subject><subject>Questions</subject><subject>Redundancy</subject><subject>Semantics</subject><subject>Special Purpose and Application-Based Systems</subject><issn>1380-7501</issn><issn>1573-7721</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>8G5</sourceid><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GNUQQ</sourceid><sourceid>GUQSH</sourceid><sourceid>M2O</sourceid><recordid>eNp9UMFOwzAMjRBIjMEPcKrEOeAkbZMc0QQMadIOwDlKW2d0jKYk6yb-nmxF4sbFtmy_5-dHyDWDWwYg7yJjkHMKHCgoqVM8IRNWSEGl5Ow01UIBlQWwc3IR4xqAlQXPJ-RlWa2x3tKmdQ4DdjVmTWh32GSrYPv3rPbdzm-Gbes7u8k63O59-IiZ8yHbtXFIva8B42Gc2S7uMbTd6pKcObuJePWbp-Tt8eF1NqeL5dPz7H5Ba8F0Oum4UJXVTVnrpF5pyHWpWIkuKZO6dHkJQiCitI3FquIqLxnHChTkVV4IMSU3I28f_FGFWfshJJ3R8IJrgMSm0hYft-rgYwzoTB_aTxu-DQNzMM-M5plknjmaZyCBxAiK_eEjDH_U_6B-AHzFco8</recordid><startdate>20210501</startdate><enddate>20210501</enddate><creator>Zhu, Xi</creator><creator>Mao, Zhendong</creator><creator>Chen, Zhineng</creator><creator>Li, Yangyang</creator><creator>Wang, Zhaohui</creator><creator>Wang, Bin</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>3V.</scope><scope>7SC</scope><scope>7WY</scope><scope>7WZ</scope><scope>7XB</scope><scope>87Z</scope><scope>8AL</scope><scope>8AO</scope><scope>8FD</scope><scope>8FE</scope><scope>8FG</scope><scope>8FK</scope><scope>8FL</scope><scope>8G5</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BEZIV</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>FRNLG</scope><scope>F~G</scope><scope>GNUQQ</scope><scope>GUQSH</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K60</scope><scope>K6~</scope><scope>K7-</scope><scope>L.-</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>M0C</scope><scope>M0N</scope><scope>M2O</scope><scope>MBDVC</scope><scope>P5Z</scope><scope>P62</scope><scope>PQBIZ</scope><scope>PQBZA</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>Q9U</scope></search><sort><creationdate>20210501</creationdate><title>Object-difference drived graph convolutional networks for visual question answering</title><author>Zhu, Xi ; Mao, Zhendong ; Chen, Zhineng ; Li, Yangyang ; Wang, Zhaohui ; Wang, Bin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c319t-df238ba9d6c9104890496816ef524796f46033eee7adaebb284612eb0804b4533</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Artificial intelligence</topic><topic>Artificial neural networks</topic><topic>Computer Communication Networks</topic><topic>Computer Science</topic><topic>Computer vision</topic><topic>Critical point</topic><topic>Data Structures and Information Theory</topic><topic>Datasets</topic><topic>Feature extraction</topic><topic>Graphical representations</topic><topic>Language</topic><topic>Multimedia</topic><topic>Multimedia Information Systems</topic><topic>Natural language processing</topic><topic>Object recognition</topic><topic>Questions</topic><topic>Redundancy</topic><topic>Semantics</topic><topic>Special Purpose and Application-Based Systems</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhu, Xi</creatorcontrib><creatorcontrib>Mao, Zhendong</creatorcontrib><creatorcontrib>Chen, Zhineng</creatorcontrib><creatorcontrib>Li, Yangyang</creatorcontrib><creatorcontrib>Wang, Zhaohui</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Central (Corporate)</collection><collection>Computer and Information Systems Abstracts</collection><collection>Access via ABI/INFORM (ProQuest)</collection><collection>ABI/INFORM Global (PDF only)</collection><collection>ProQuest Central (purchase pre-March 2016)</collection><collection>ABI/INFORM Global (Alumni Edition)</collection><collection>Computing Database (Alumni Edition)</collection><collection>ProQuest Pharma Collection</collection><collection>Technology Research Database</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>ProQuest Central (Alumni) (purchase pre-March 2016)</collection><collection>ABI/INFORM Collection (Alumni Edition)</collection><collection>Research Library (Alumni Edition)</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Business Premium Collection</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>Business Premium Collection (Alumni)</collection><collection>ABI/INFORM Global (Corporate)</collection><collection>ProQuest Central Student</collection><collection>Research Library Prep</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Business Collection (Alumni Edition)</collection><collection>ProQuest Business Collection</collection><collection>Computer Science Database</collection><collection>ABI/INFORM Professional Advanced</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>ABI/INFORM Global</collection><collection>Computing Database</collection><collection>Research Library</collection><collection>Research Library (Corporate)</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>One Business (ProQuest)</collection><collection>ProQuest One Business (Alumni)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central Basic</collection><jtitle>Multimedia tools and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhu, Xi</au><au>Mao, Zhendong</au><au>Chen, Zhineng</au><au>Li, Yangyang</au><au>Wang, Zhaohui</au><au>Wang, Bin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Object-difference drived graph convolutional networks for visual question answering</atitle><jtitle>Multimedia tools and applications</jtitle><stitle>Multimed Tools Appl</stitle><date>2021-05-01</date><risdate>2021</risdate><volume>80</volume><issue>11</issue><spage>16247</spage><epage>16265</epage><pages>16247-16265</pages><issn>1380-7501</issn><eissn>1573-7721</eissn><abstract>Visual Question Answering(VQA), an important task to evaluate the cross-modal understanding capability of an Artificial Intelligence model, has been a hot research topic in both computer vision and natural language processing communities. Recently, graph-based models have received growing interest in VQA, for its potential of modeling the relationships between objects as well as its formidable interpretability. Nonetheless, those solutions mainly define the similarity between objects as their semantical relationships, while largely ignoring the critical point that the difference between objects can provide more information for establishing the relationship between nodes in the graph. To achieve this, we propose an object-difference based graph learner, which learns question-adaptive semantic relations by calculating inter-object difference under the guidance of questions. With the learned relationships, the input image can be represented as an object graph encoded with structural dependencies between objects. In addition, existing graph-based models leverage the pre-extracted object boxes by the object detection model as node features for convenience, but they are suffering from the redundancy problem. To reduce the redundant objects, we introduce a soft-attention mechanism to magnify the question-related objects. Moreover, we incorporate our object-difference based graph learner into the soft-attention based Graph Convolutional Networks to capture question-specific objects and their interactions for answer prediction. Our experimental results on the VQA 2.0 dataset demonstrate that our model gives significantly better performance than baseline methods.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s11042-020-08790-0</doi><tpages>19</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1380-7501
ispartof Multimedia tools and applications, 2021-05, Vol.80 (11), p.16247-16265
issn 1380-7501
1573-7721
language eng
recordid cdi_proquest_journals_2529006818
source SpringerLink Journals - AutoHoldings
subjects Artificial intelligence
Artificial neural networks
Computer Communication Networks
Computer Science
Computer vision
Critical point
Data Structures and Information Theory
Datasets
Feature extraction
Graphical representations
Language
Multimedia
Multimedia Information Systems
Natural language processing
Object recognition
Questions
Redundancy
Semantics
Special Purpose and Application-Based Systems
title Object-difference drived graph convolutional networks for visual question answering
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T21%3A21%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Object-difference%20drived%20graph%20convolutional%20networks%20for%20visual%20question%20answering&rft.jtitle=Multimedia%20tools%20and%20applications&rft.au=Zhu,%20Xi&rft.date=2021-05-01&rft.volume=80&rft.issue=11&rft.spage=16247&rft.epage=16265&rft.pages=16247-16265&rft.issn=1380-7501&rft.eissn=1573-7721&rft_id=info:doi/10.1007/s11042-020-08790-0&rft_dat=%3Cproquest_cross%3E2529006818%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2529006818&rft_id=info:pmid/&rfr_iscdi=true