Self-Adaptive Neural Module Transformer for Visual Question Answering

Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and i...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2021, Vol.23, p.1264-1273
Hauptverfasser:	Zhong, Huasong, Chen, Jingyuan, Shen, Chen, Zhang, Hanwang, Huang, Jianqiang, Hua, Xian-Sheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Coders Cognition Decoding Encoders-Decoders Knowledge discovery Layout Layouts Modules multi modal Multimedia Natural language processing neural module transformer Neural networks Questions Reasoning self-adaptive Task analysis Transformers Visual question answering Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1273
container_issue
container_start_page	1264
container_title	IEEE transactions on multimedia
container_volume	23
creator	Zhong, Huasong Chen, Jingyuan Shen, Chen Zhang, Hanwang Huang, Jianqiang Hua, Xian-Sheng
description	Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).
doi_str_mv	10.1109/TMM.2020.2995278
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2519084011</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9095237</ieee_id><sourcerecordid>2519084011</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</originalsourceid><addsrcrecordid>eNo9kN1LwzAUxYMoOKfvgi8FnztvkqZJHseYH7Aq4vQ1pM2tdHTtTFbF_96UDZ_OhXvOvYcfIdcUZpSCvlsXxYwBgxnTWjCpTsiE6oymAFKexlkwSDWjcE4uQtgA0EyAnJDlG7Z1Ond2t2--MXnGwds2KXo3tJisve1C3fst-iRK8tGEIW5fBwz7pu-SeRd-0Dfd5yU5q20b8OqoU_J-v1wvHtPVy8PTYr5KK6bpPq2Vq7jKlGSIUnFb87x0krO8hJK5nMrcxfKKyRIqq0oQTjvgyqFgYkzxKbk93N35_mtsYTb94Lv40jBBNagMKI0uOLgq34fgsTY732yt_zUUzAjLRFhmhGWOsGLk5hBpEPHfriFuueR_0QdkYg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2519084011</pqid></control><display><type>article</type><title>Self-Adaptive Neural Module Transformer for Visual Question Answering</title><source>IEEE Electronic Library (IEL)</source><creator>Zhong, Huasong ; Chen, Jingyuan ; Shen, Chen ; Zhang, Hanwang ; Huang, Jianqiang ; Hua, Xian-Sheng</creator><creatorcontrib>Zhong, Huasong ; Chen, Jingyuan ; Shen, Chen ; Zhang, Hanwang ; Huang, Jianqiang ; Hua, Xian-Sheng</creatorcontrib><description>Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2020.2995278</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Coders ; Cognition ; Decoding ; Encoders-Decoders ; Knowledge discovery ; Layout ; Layouts ; Modules ; multi modal ; Multimedia ; Natural language processing ; neural module transformer ; Neural networks ; Questions ; Reasoning ; self-adaptive ; Task analysis ; Transformers ; Visual question answering ; Visualization</subject><ispartof>IEEE transactions on multimedia, 2021, Vol.23, p.1264-1273</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</citedby><cites>FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</cites><orcidid>0000-0002-7534-0830 ; 0000-0001-7172-0556</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9095237$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9095237$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Zhong, Huasong</creatorcontrib><creatorcontrib>Chen, Jingyuan</creatorcontrib><creatorcontrib>Shen, Chen</creatorcontrib><creatorcontrib>Zhang, Hanwang</creatorcontrib><creatorcontrib>Huang, Jianqiang</creatorcontrib><creatorcontrib>Hua, Xian-Sheng</creatorcontrib><title>Self-Adaptive Neural Module Transformer for Visual Question Answering</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).</description><subject>Coders</subject><subject>Cognition</subject><subject>Decoding</subject><subject>Encoders-Decoders</subject><subject>Knowledge discovery</subject><subject>Layout</subject><subject>Layouts</subject><subject>Modules</subject><subject>multi modal</subject><subject>Multimedia</subject><subject>Natural language processing</subject><subject>neural module transformer</subject><subject>Neural networks</subject><subject>Questions</subject><subject>Reasoning</subject><subject>self-adaptive</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Visual question answering</subject><subject>Visualization</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kN1LwzAUxYMoOKfvgi8FnztvkqZJHseYH7Aq4vQ1pM2tdHTtTFbF_96UDZ_OhXvOvYcfIdcUZpSCvlsXxYwBgxnTWjCpTsiE6oymAFKexlkwSDWjcE4uQtgA0EyAnJDlG7Z1Ond2t2--MXnGwds2KXo3tJisve1C3fst-iRK8tGEIW5fBwz7pu-SeRd-0Dfd5yU5q20b8OqoU_J-v1wvHtPVy8PTYr5KK6bpPq2Vq7jKlGSIUnFb87x0krO8hJK5nMrcxfKKyRIqq0oQTjvgyqFgYkzxKbk93N35_mtsYTb94Lv40jBBNagMKI0uOLgq34fgsTY732yt_zUUzAjLRFhmhGWOsGLk5hBpEPHfriFuueR_0QdkYg</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Zhong, Huasong</creator><creator>Chen, Jingyuan</creator><creator>Shen, Chen</creator><creator>Zhang, Hanwang</creator><creator>Huang, Jianqiang</creator><creator>Hua, Xian-Sheng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-7534-0830</orcidid><orcidid>https://orcid.org/0000-0001-7172-0556</orcidid></search><sort><creationdate>2021</creationdate><title>Self-Adaptive Neural Module Transformer for Visual Question Answering</title><author>Zhong, Huasong ; Chen, Jingyuan ; Shen, Chen ; Zhang, Hanwang ; Huang, Jianqiang ; Hua, Xian-Sheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-f8dc384872ee783af36bd7326b0b2d6176d995827b0ca8b05d9d038de52548723</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Coders</topic><topic>Cognition</topic><topic>Decoding</topic><topic>Encoders-Decoders</topic><topic>Knowledge discovery</topic><topic>Layout</topic><topic>Layouts</topic><topic>Modules</topic><topic>multi modal</topic><topic>Multimedia</topic><topic>Natural language processing</topic><topic>neural module transformer</topic><topic>Neural networks</topic><topic>Questions</topic><topic>Reasoning</topic><topic>self-adaptive</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Visual question answering</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhong, Huasong</creatorcontrib><creatorcontrib>Chen, Jingyuan</creatorcontrib><creatorcontrib>Shen, Chen</creatorcontrib><creatorcontrib>Zhang, Hanwang</creatorcontrib><creatorcontrib>Huang, Jianqiang</creatorcontrib><creatorcontrib>Hua, Xian-Sheng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhong, Huasong</au><au>Chen, Jingyuan</au><au>Shen, Chen</au><au>Zhang, Hanwang</au><au>Huang, Jianqiang</au><au>Hua, Xian-Sheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Self-Adaptive Neural Module Transformer for Visual Question Answering</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2021</date><risdate>2021</risdate><volume>23</volume><spage>1264</spage><epage>1273</epage><pages>1264-1273</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Vision and language understanding is one of the most fundamental and difficult tasks in Multimedia Intelligence. Simultaneously Visual Question Answering (VQA) is even more challenging since it requires complex reasoning steps to the correct answer. To achieve this, Neural Module Network (NMN) and its variants rely on parsing the natural language question into a module layout (i.e., a problem-solving program). In particular, this process follows a feedforward encoder-decoder pipeline: the encoder embeds the question into a static vector and the decoder generates the layout. However, we argue that such conventional encoder-decoder neglects the dynamic nature of question comprehension (i.e., we should attend to different words from step to step) and per-module intermediate results (i.e., we should discard module performing badly) in the reasoning steps. In this paper, we present a novel NMN, called Self-Adaptive Neural Module Transformer (SANMT), which adaptively adjusts both of the question feature encoding and the layout decoding by considering intermediate Q&A results. Specifically, we encode the intermediate results with the given question features by a novel transformer module to generate dynamic question feature embedding which evolves over reasoning steps. Besides, the transformer utilizes the intermediate results from each reasoning step to guide subsequent layout arrangement. Extensive experimental evaluations demonstrate the superiority of the proposed SANMT over NMN and its variants on four challenging benchmarks, including CLEVR, CLEVR-CoGenT, VQAv1.0, and VQAv2.0 (on average the relative improvement over NMN are 1.5, 2.3, 0.7 and 0.5 points with respect to accuracy).</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2020.2995278</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-7534-0830</orcidid><orcidid>https://orcid.org/0000-0001-7172-0556</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1520-9210
ispartof	IEEE transactions on multimedia, 2021, Vol.23, p.1264-1273
issn	1520-9210 1941-0077
language	eng
recordid	cdi_proquest_journals_2519084011
source	IEEE Electronic Library (IEL)
subjects	Coders Cognition Decoding Encoders-Decoders Knowledge discovery Layout Layouts Modules multi modal Multimedia Natural language processing neural module transformer Neural networks Questions Reasoning self-adaptive Task analysis Transformers Visual question answering Visualization
title	Self-Adaptive Neural Module Transformer for Visual Question Answering
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-22T10%3A16%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Self-Adaptive%20Neural%20Module%20Transformer%20for%20Visual%20Question%20Answering&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Zhong,%20Huasong&rft.date=2021&rft.volume=23&rft.spage=1264&rft.epage=1273&rft.pages=1264-1273&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2020.2995278&rft_dat=%3Cproquest_RIE%3E2519084011%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2519084011&rft_id=info:pmid/&rft_ieee_id=9095237&rfr_iscdi=true