Self-Adaptive Neural Module Transformer for Visual Question Answering
Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and i...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on multimedia 2021, Vol.23, p.1264-1273 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1273 |
---|---|
container_issue | |
container_start_page | 1264 |
container_title | IEEE transactions on multimedia |
container_volume | 23 |
creator | Zhong, Huasong Chen, Jingyuan Shen, Chen Zhang, Hanwang Huang, Jianqiang Hua, Xian-Sheng |
description | Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy). |
doi_str_mv | 10.1109/TMM.2020.2995278 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2519084011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9095237</ieee_id><sourcerecordid>2519084011</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</originalsourceid><addsrcrecordid>eNo9kN1LwzAUxYMoOKfvgi8FnztvkqZJHseYH7Aq4vQ1pM2tdHTtTFbF_96UDZ_OhXvOvYcfIdcUZpSCvlsXxYwBgxnTWjCpTsiE6oymAFKexlkwSDWjcE4uQtgA0EyAnJDlG7Z1Ond2t2--MXnGwds2KXo3tJisve1C3fst-iRK8tGEIW5fBwz7pu-SeRd-0Dfd5yU5q20b8OqoU_J-v1wvHtPVy8PTYr5KK6bpPq2Vq7jKlGSIUnFb87x0krO8hJK5nMrcxfKKyRIqq0oQTjvgyqFgYkzxKbk93N35_mtsYTb94Lv40jBBNagMKI0uOLgq34fgsTY732yt_zUUzAjLRFhmhGWOsGLk5hBpEPHfriFuueR_0QdkYg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2519084011</pqid></control><display><type>article</type><title>Self-Adaptive Neural Module Transformer for Visual Question Answering</title><source>IEEE Electronic Library (IEL)</source><creator>Zhong, Huasong ; Chen, Jingyuan ; Shen, Chen ; Zhang, Hanwang ; Huang, Jianqiang ; Hua, Xian-Sheng</creator><creatorcontrib>Zhong, Huasong ; Chen, Jingyuan ; Shen, Chen ; Zhang, Hanwang ; Huang, Jianqiang ; Hua, Xian-Sheng</creatorcontrib><description>Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2020.2995278</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Coders ; Cognition ; Decoding ; Encoders-Decoders ; Knowledge discovery ; Layout ; Layouts ; Modules ; multi modal ; Multimedia ; Natural language processing ; neural module transformer ; Neural networks ; Questions ; Reasoning ; self-adaptive ; Task analysis ; Transformers ; Visual question answering ; Visualization</subject><ispartof>IEEE transactions on multimedia, 2021, Vol.23, p.1264-1273</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</citedby><cites>FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</cites><orcidid>0000-0002-7534-0830 ; 0000-0001-7172-0556</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9095237$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9095237$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhong, Huasong</creatorcontrib><creatorcontrib>Chen, Jingyuan</creatorcontrib><creatorcontrib>Shen, Chen</creatorcontrib><creatorcontrib>Zhang, Hanwang</creatorcontrib><creatorcontrib>Huang, Jianqiang</creatorcontrib><creatorcontrib>Hua, Xian-Sheng</creatorcontrib><title>Self-Adaptive Neural Module Transformer for Visual Question Answering</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).</description><subject>Coders</subject><subject>Cognition</subject><subject>Decoding</subject><subject>Encoders-Decoders</subject><subject>Knowledge discovery</subject><subject>Layout</subject><subject>Layouts</subject><subject>Modules</subject><subject>multi modal</subject><subject>Multimedia</subject><subject>Natural language processing</subject><subject>neural module transformer</subject><subject>Neural networks</subject><subject>Questions</subject><subject>Reasoning</subject><subject>self-adaptive</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Visual question answering</subject><subject>Visualization</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kN1LwzAUxYMoOKfvgi8FnztvkqZJHseYH7Aq4vQ1pM2tdHTtTFbF_96UDZ_OhXvOvYcfIdcUZpSCvlsXxYwBgxnTWjCpTsiE6oymAFKexlkwSDWjcE4uQtgA0EyAnJDlG7Z1Ond2t2--MXnGwds2KXo3tJisve1C3fst-iRK8tGEIW5fBwz7pu-SeRd-0Dfd5yU5q20b8OqoU_J-v1wvHtPVy8PTYr5KK6bpPq2Vq7jKlGSIUnFb87x0krO8hJK5nMrcxfKKyRIqq0oQTjvgyqFgYkzxKbk93N35_mtsYTb94Lv40jBBNagMKI0uOLgq34fgsTY732yt_zUUzAjLRFhmhGWOsGLk5hBpEPHfriFuueR_0QdkYg</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Zhong, Huasong</creator><creator>Chen, Jingyuan</creator><creator>Shen, Chen</creator><creator>Zhang, Hanwang</creator><creator>Huang, Jianqiang</creator><creator>Hua, Xian-Sheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-7534-0830</orcidid><orcidid>https://orcid.org/0000-0001-7172-0556</orcidid></search><sort><creationdate>2021</creationdate><title>Self-Adaptive Neural Module Transformer for Visual Question Answering</title><author>Zhong, Huasong ; Chen, Jingyuan ; Shen, Chen ; Zhang, Hanwang ; Huang, Jianqiang ; Hua, Xian-Sheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Coders</topic><topic>Cognition</topic><topic>Decoding</topic><topic>Encoders-Decoders</topic><topic>Knowledge discovery</topic><topic>Layout</topic><topic>Layouts</topic><topic>Modules</topic><topic>multi modal</topic><topic>Multimedia</topic><topic>Natural language processing</topic><topic>neural module transformer</topic><topic>Neural networks</topic><topic>Questions</topic><topic>Reasoning</topic><topic>self-adaptive</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Visual question answering</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhong, Huasong</creatorcontrib><creatorcontrib>Chen, Jingyuan</creatorcontrib><creatorcontrib>Shen, Chen</creatorcontrib><creatorcontrib>Zhang, Hanwang</creatorcontrib><creatorcontrib>Huang, Jianqiang</creatorcontrib><creatorcontrib>Hua, Xian-Sheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhong, Huasong</au><au>Chen, Jingyuan</au><au>Shen, Chen</au><au>Zhang, Hanwang</au><au>Huang, Jianqiang</au><au>Hua, Xian-Sheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Self-Adaptive Neural Module Transformer for Visual Question Answering</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2021</date><risdate>2021</risdate><volume>23</volume><spage>1264</spage><epage>1273</epage><pages>1264-1273</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2020.2995278</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-7534-0830</orcidid><orcidid>https://orcid.org/0000-0001-7172-0556</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1520-9210 |
ispartof | IEEE transactions on multimedia, 2021, Vol.23, p.1264-1273 |
issn | 1520-9210 1941-0077 |
language | eng |
recordid | cdi_proquest_journals_2519084011 |
source | IEEE Electronic Library (IEL) |
subjects | Coders Cognition Decoding Encoders-Decoders Knowledge discovery Layout Layouts Modules multi modal Multimedia Natural language processing neural module transformer Neural networks Questions Reasoning self-adaptive Task analysis Transformers Visual question answering Visualization |
title | Self-Adaptive Neural Module Transformer for Visual Question Answering |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-22T10%3A16%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Self-Adaptive%20Neural%20Module%20Transformer%20for%20Visual%20Question%20Answering&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Zhong,%20Huasong&rft.date=2021&rft.volume=23&rft.spage=1264&rft.epage=1273&rft.pages=1264-1273&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2020.2995278&rft_dat=%3Cproquest_RIE%3E2519084011%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2519084011&rft_id=info:pmid/&rft_ieee_id=9095237&rfr_iscdi=true |