Self-Adaptive Neural Module Transformer for Visual Question Answering

Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and i...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2021, Vol.23, p.1264-1273
Hauptverfasser: Zhong, Huasong, Chen, Jingyuan, Shen, Chen, Zhang, Hanwang, Huang, Jianqiang, Hua, Xian-Sheng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1273
container_issue
container_start_page 1264
container_title IEEE transactions on multimedia
container_volume 23
creator Zhong, Huasong
Chen, Jingyuan
Shen, Chen
Zhang, Hanwang
Huang, Jianqiang
Hua, Xian-Sheng
description Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).
doi_str_mv 10.1109/TMM.2020.2995278
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2519084011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9095237</ieee_id><sourcerecordid>2519084011</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</originalsourceid><addsrcrecordid>eNo9kN1LwzAUxYMoOKfvgi8FnztvkqZJHseYH7Aq4vQ1pM2tdHTtTFbF_96UDZ_OhXvOvYcfIdcUZpSCvlsXxYwBgxnTWjCpTsiE6oymAFKexlkwSDWjcE4uQtgA0EyAnJDlG7Z1Ond2t2--MXnGwds2KXo3tJisve1C3fst-iRK8tGEIW5fBwz7pu-SeRd-0Dfd5yU5q20b8OqoU_J-v1wvHtPVy8PTYr5KK6bpPq2Vq7jKlGSIUnFb87x0krO8hJK5nMrcxfKKyRIqq0oQTjvgyqFgYkzxKbk93N35_mtsYTb94Lv40jBBNagMKI0uOLgq34fgsTY732yt_zUUzAjLRFhmhGWOsGLk5hBpEPHfriFuueR_0QdkYg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2519084011</pqid></control><display><type>article</type><title>Self-Adaptive Neural Module Transformer for Visual Question Answering</title><source>IEEE Electronic Library (IEL)</source><creator>Zhong, Huasong ; Chen, Jingyuan ; Shen, Chen ; Zhang, Hanwang ; Huang, Jianqiang ; Hua, Xian-Sheng</creator><creatorcontrib>Zhong, Huasong ; Chen, Jingyuan ; Shen, Chen ; Zhang, Hanwang ; Huang, Jianqiang ; Hua, Xian-Sheng</creatorcontrib><description>Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&amp;A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2020.2995278</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Coders ; Cognition ; Decoding ; Encoders-Decoders ; Knowledge discovery ; Layout ; Layouts ; Modules ; multi modal ; Multimedia ; Natural language processing ; neural module transformer ; Neural networks ; Questions ; Reasoning ; self-adaptive ; Task analysis ; Transformers ; Visual question answering ; Visualization</subject><ispartof>IEEE transactions on multimedia, 2021, Vol.23, p.1264-1273</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</citedby><cites>FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</cites><orcidid>0000-0002-7534-0830 ; 0000-0001-7172-0556</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9095237$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9095237$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhong, Huasong</creatorcontrib><creatorcontrib>Chen, Jingyuan</creatorcontrib><creatorcontrib>Shen, Chen</creatorcontrib><creatorcontrib>Zhang, Hanwang</creatorcontrib><creatorcontrib>Huang, Jianqiang</creatorcontrib><creatorcontrib>Hua, Xian-Sheng</creatorcontrib><title>Self-Adaptive Neural Module Transformer for Visual Question Answering</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&amp;A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).</description><subject>Coders</subject><subject>Cognition</subject><subject>Decoding</subject><subject>Encoders-Decoders</subject><subject>Knowledge discovery</subject><subject>Layout</subject><subject>Layouts</subject><subject>Modules</subject><subject>multi modal</subject><subject>Multimedia</subject><subject>Natural language processing</subject><subject>neural module transformer</subject><subject>Neural networks</subject><subject>Questions</subject><subject>Reasoning</subject><subject>self-adaptive</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Visual question answering</subject><subject>Visualization</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kN1LwzAUxYMoOKfvgi8FnztvkqZJHseYH7Aq4vQ1pM2tdHTtTFbF_96UDZ_OhXvOvYcfIdcUZpSCvlsXxYwBgxnTWjCpTsiE6oymAFKexlkwSDWjcE4uQtgA0EyAnJDlG7Z1Ond2t2--MXnGwds2KXo3tJisve1C3fst-iRK8tGEIW5fBwz7pu-SeRd-0Dfd5yU5q20b8OqoU_J-v1wvHtPVy8PTYr5KK6bpPq2Vq7jKlGSIUnFb87x0krO8hJK5nMrcxfKKyRIqq0oQTjvgyqFgYkzxKbk93N35_mtsYTb94Lv40jBBNagMKI0uOLgq34fgsTY732yt_zUUzAjLRFhmhGWOsGLk5hBpEPHfriFuueR_0QdkYg</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Zhong, Huasong</creator><creator>Chen, Jingyuan</creator><creator>Shen, Chen</creator><creator>Zhang, Hanwang</creator><creator>Huang, Jianqiang</creator><creator>Hua, Xian-Sheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-7534-0830</orcidid><orcidid>https://orcid.org/0000-0001-7172-0556</orcidid></search><sort><creationdate>2021</creationdate><title>Self-Adaptive Neural Module Transformer for Visual Question Answering</title><author>Zhong, Huasong ; Chen, Jingyuan ; Shen, Chen ; Zhang, Hanwang ; Huang, Jianqiang ; Hua, Xian-Sheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Coders</topic><topic>Cognition</topic><topic>Decoding</topic><topic>Encoders-Decoders</topic><topic>Knowledge discovery</topic><topic>Layout</topic><topic>Layouts</topic><topic>Modules</topic><topic>multi modal</topic><topic>Multimedia</topic><topic>Natural language processing</topic><topic>neural module transformer</topic><topic>Neural networks</topic><topic>Questions</topic><topic>Reasoning</topic><topic>self-adaptive</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Visual question answering</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhong, Huasong</creatorcontrib><creatorcontrib>Chen, Jingyuan</creatorcontrib><creatorcontrib>Shen, Chen</creatorcontrib><creatorcontrib>Zhang, Hanwang</creatorcontrib><creatorcontrib>Huang, Jianqiang</creatorcontrib><creatorcontrib>Hua, Xian-Sheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhong, Huasong</au><au>Chen, Jingyuan</au><au>Shen, Chen</au><au>Zhang, Hanwang</au><au>Huang, Jianqiang</au><au>Hua, Xian-Sheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Self-Adaptive Neural Module Transformer for Visual Question Answering</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2021</date><risdate>2021</risdate><volume>23</volume><spage>1264</spage><epage>1273</epage><pages>1264-1273</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&amp;A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2020.2995278</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-7534-0830</orcidid><orcidid>https://orcid.org/0000-0001-7172-0556</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1520-9210
ispartof IEEE transactions on multimedia, 2021, Vol.23, p.1264-1273
issn 1520-9210
1941-0077
language eng
recordid cdi_proquest_journals_2519084011
source IEEE Electronic Library (IEL)
subjects Coders
Cognition
Decoding
Encoders-Decoders
Knowledge discovery
Layout
Layouts
Modules
multi modal
Multimedia
Natural language processing
neural module transformer
Neural networks
Questions
Reasoning
self-adaptive
Task analysis
Transformers
Visual question answering
Visualization
title Self-Adaptive Neural Module Transformer for Visual Question Answering
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-22T10%3A16%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Self-Adaptive%20Neural%20Module%20Transformer%20for%20Visual%20Question%20Answering&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Zhong,%20Huasong&rft.date=2021&rft.volume=23&rft.spage=1264&rft.epage=1273&rft.pages=1264-1273&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2020.2995278&rft_dat=%3Cproquest_RIE%3E2519084011%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2519084011&rft_id=info:pmid/&rft_ieee_id=9095237&rfr_iscdi=true