A Better Way to Attend: Attention With Trees for Video Question Answering

We propose a new attention model for video question answering. The main idea of the attention models is to locate on the most informative parts of the visual data. The attention mechanisms are quite popular these days. However, most existing visual attention mechanisms regard the question as a whole...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing 2018-11, Vol.27 (11), p.5563-5574
Hauptverfasser:	Xue, Hongyang, Chu, Wenqing, Zhao, Zhou, Cai, Deng
Format:	Artikel
Sprache:	eng
Schlagworte:	attention model Computational modeling Knowledge discovery Natural languages scene understanding Semantics Sentences Syntactics Task analysis Trees Video question answering Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	5574
container_issue	11
container_start_page	5563
container_title	IEEE transactions on image processing
container_volume	27
creator	Xue, Hongyang Chu, Wenqing Zhao, Zhou Cai, Deng
description	We propose a new attention model for video question answering. The main idea of the attention models is to locate on the most informative parts of the visual data. The attention mechanisms are quite popular these days. However, most existing visual attention mechanisms regard the question as a whole. They ignore the word-level semantics where each word can have different attentions and some words need no attention. Neither do they consider the semantic structure of the sentences. Although the extended soft attention model for video question answering leverages the word-level attention, it performs poorly on long question sentences. In this paper, we propose the heterogeneous tree-structured memory network (HTreeMN) for video question answering. Our proposed approach is based upon the syntax parse trees of the question sentences. The HTreeMN treats the words differently where the visual words are processed with an attention module and the verbal ones not. It also utilizes the semantic structure of the sentences by combining the neighbors based on the recursive structure of the parse trees. The understandings of the words and the videos are propagated and merged from leaves to the root. Furthermore, we build a hierarchical attention mechanism to distill the attended features. We evaluate our approach on two data sets. The experimental results show the superiority of our HTreeMN model over the other attention models, especially on complex questions.
doi_str_mv	10.1109/TIP.2018.2859820
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TIP_2018_2859820</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8419716</ieee_id><sourcerecordid>2076912838</sourcerecordid><originalsourceid>FETCH-LOGICAL-c347t-47f3d2fa038392bbfa84cb028983c36161722260283cf35a797827bf44a474ee3</originalsourceid><addsrcrecordid>eNpdkMtLAzEQh4MotlbvgiABL1625tVN4m0tPgoFFao9hn1MdEu7W5NdpP-9qVt78DTJzPcbhg-hc0qGlBJ9M5u8DBmhasjUSCtGDlCfakEjQgQ7DG8ykpGkQvfQifcLQqgY0fgY9XgApFKsjyYJvoOmAYfn6QY3NU7Cpypuu9qUdYXnZfOJZw7AY1s7_F4WUOPXFvzvNKn8N7iy-jhFRzZdejjb1QF6e7ifjZ-i6fPjZJxMo5wL2URCWl4wmxKuuGZZZlMl8owwpRXPeUxjKhljcWjw3PJRKrVUTGZWiFRIAcAH6Lrbu3b11_YKsyp9DstlWkHdesOIjDUNcRXQq3_oom5dFa4LlFI6ONAsUKSjcld778CatStXqdsYSsxWswmazVaz2WkOkcvd4jZbQbEP_HkNwEUHlACwHytBtaQx_wE5SX28</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2088947892</pqid></control><display><type>article</type><title>A Better Way to Attend: Attention With Trees for Video Question Answering</title><source>IEEE Electronic Library (IEL)</source><creator>Xue, Hongyang ; Chu, Wenqing ; Zhao, Zhou ; Cai, Deng</creator><creatorcontrib>Xue, Hongyang ; Chu, Wenqing ; Zhao, Zhou ; Cai, Deng</creatorcontrib><description>We propose a new attention model for video question answering. The main idea of the attention models is to locate on the most informative parts of the visual data. The attention mechanisms are quite popular these days. However, most existing visual attention mechanisms regard the question as a whole. They ignore the word-level semantics where each word can have different attentions and some words need no attention. Neither do they consider the semantic structure of the sentences. Although the extended soft attention model for video question answering leverages the word-level attention, it performs poorly on long question sentences. In this paper, we propose the heterogeneous tree-structured memory network (HTreeMN) for video question answering. Our proposed approach is based upon the syntax parse trees of the question sentences. The HTreeMN treats the words differently where the visual words are processed with an attention module and the verbal ones not. It also utilizes the semantic structure of the sentences by combining the neighbors based on the recursive structure of the parse trees. The understandings of the words and the videos are propagated and merged from leaves to the root. Furthermore, we build a hierarchical attention mechanism to distill the attended features. We evaluate our approach on two data sets. The experimental results show the superiority of our HTreeMN model over the other attention models, especially on complex questions.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2018.2859820</identifier><identifier>PMID: 30047882</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>attention model ; Computational modeling ; Knowledge discovery ; Natural languages ; scene understanding ; Semantics ; Sentences ; Syntactics ; Task analysis ; Trees ; Video question answering ; Visualization</subject><ispartof>IEEE transactions on image processing, 2018-11, Vol.27 (11), p.5563-5574</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2018</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c347t-47f3d2fa038392bbfa84cb028983c36161722260283cf35a797827bf44a474ee3</citedby><cites>FETCH-LOGICAL-c347t-47f3d2fa038392bbfa84cb028983c36161722260283cf35a797827bf44a474ee3</cites><orcidid>0000-0003-0816-7975 ; 0000-0003-3161-3566</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8419716$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8419716$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/30047882$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Xue, Hongyang</creatorcontrib><creatorcontrib>Chu, Wenqing</creatorcontrib><creatorcontrib>Zhao, Zhou</creatorcontrib><creatorcontrib>Cai, Deng</creatorcontrib><title>A Better Way to Attend: Attention With Trees for Video Question Answering</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><addtitle>IEEE Trans Image Process</addtitle><description>We propose a new attention model for video question answering. The main idea of the attention models is to locate on the most informative parts of the visual data. The attention mechanisms are quite popular these days. However, most existing visual attention mechanisms regard the question as a whole. They ignore the word-level semantics where each word can have different attentions and some words need no attention. Neither do they consider the semantic structure of the sentences. Although the extended soft attention model for video question answering leverages the word-level attention, it performs poorly on long question sentences. In this paper, we propose the heterogeneous tree-structured memory network (HTreeMN) for video question answering. Our proposed approach is based upon the syntax parse trees of the question sentences. The HTreeMN treats the words differently where the visual words are processed with an attention module and the verbal ones not. It also utilizes the semantic structure of the sentences by combining the neighbors based on the recursive structure of the parse trees. The understandings of the words and the videos are propagated and merged from leaves to the root. Furthermore, we build a hierarchical attention mechanism to distill the attended features. We evaluate our approach on two data sets. The experimental results show the superiority of our HTreeMN model over the other attention models, especially on complex questions.</description><subject>attention model</subject><subject>Computational modeling</subject><subject>Knowledge discovery</subject><subject>Natural languages</subject><subject>scene understanding</subject><subject>Semantics</subject><subject>Sentences</subject><subject>Syntactics</subject><subject>Task analysis</subject><subject>Trees</subject><subject>Video question answering</subject><subject>Visualization</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkMtLAzEQh4MotlbvgiABL1625tVN4m0tPgoFFao9hn1MdEu7W5NdpP-9qVt78DTJzPcbhg-hc0qGlBJ9M5u8DBmhasjUSCtGDlCfakEjQgQ7DG8ykpGkQvfQifcLQqgY0fgY9XgApFKsjyYJvoOmAYfn6QY3NU7Cpypuu9qUdYXnZfOJZw7AY1s7_F4WUOPXFvzvNKn8N7iy-jhFRzZdejjb1QF6e7ifjZ-i6fPjZJxMo5wL2URCWl4wmxKuuGZZZlMl8owwpRXPeUxjKhljcWjw3PJRKrVUTGZWiFRIAcAH6Lrbu3b11_YKsyp9DstlWkHdesOIjDUNcRXQq3_oom5dFa4LlFI6ONAsUKSjcld778CatStXqdsYSsxWswmazVaz2WkOkcvd4jZbQbEP_HkNwEUHlACwHytBtaQx_wE5SX28</recordid><startdate>20181101</startdate><enddate>20181101</enddate><creator>Xue, Hongyang</creator><creator>Chu, Wenqing</creator><creator>Zhao, Zhou</creator><creator>Cai, Deng</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-0816-7975</orcidid><orcidid>https://orcid.org/0000-0003-3161-3566</orcidid></search><sort><creationdate>20181101</creationdate><title>A Better Way to Attend: Attention With Trees for Video Question Answering</title><author>Xue, Hongyang ; Chu, Wenqing ; Zhao, Zhou ; Cai, Deng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c347t-47f3d2fa038392bbfa84cb028983c36161722260283cf35a797827bf44a474ee3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>attention model</topic><topic>Computational modeling</topic><topic>Knowledge discovery</topic><topic>Natural languages</topic><topic>scene understanding</topic><topic>Semantics</topic><topic>Sentences</topic><topic>Syntactics</topic><topic>Task analysis</topic><topic>Trees</topic><topic>Video question answering</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xue, Hongyang</creatorcontrib><creatorcontrib>Chu, Wenqing</creatorcontrib><creatorcontrib>Zhao, Zhou</creatorcontrib><creatorcontrib>Cai, Deng</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xue, Hongyang</au><au>Chu, Wenqing</au><au>Zhao, Zhou</au><au>Cai, Deng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A Better Way to Attend: Attention With Trees for Video Question Answering</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><addtitle>IEEE Trans Image Process</addtitle><date>2018-11-01</date><risdate>2018</risdate><volume>27</volume><issue>11</issue><spage>5563</spage><epage>5574</epage><pages>5563-5574</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>We propose a new attention model for video question answering. The main idea of the attention models is to locate on the most informative parts of the visual data. The attention mechanisms are quite popular these days. However, most existing visual attention mechanisms regard the question as a whole. They ignore the word-level semantics where each word can have different attentions and some words need no attention. Neither do they consider the semantic structure of the sentences. Although the extended soft attention model for video question answering leverages the word-level attention, it performs poorly on long question sentences. In this paper, we propose the heterogeneous tree-structured memory network (HTreeMN) for video question answering. Our proposed approach is based upon the syntax parse trees of the question sentences. The HTreeMN treats the words differently where the visual words are processed with an attention module and the verbal ones not. It also utilizes the semantic structure of the sentences by combining the neighbors based on the recursive structure of the parse trees. The understandings of the words and the videos are propagated and merged from leaves to the root. Furthermore, we build a hierarchical attention mechanism to distill the attended features. We evaluate our approach on two data sets. The experimental results show the superiority of our HTreeMN model over the other attention models, especially on complex questions.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>30047882</pmid><doi>10.1109/TIP.2018.2859820</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-0816-7975</orcidid><orcidid>https://orcid.org/0000-0003-3161-3566</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1057-7149
ispartof	IEEE transactions on image processing, 2018-11, Vol.27 (11), p.5563-5574
issn	1057-7149 1941-0042
language	eng
recordid	cdi_crossref_primary_10_1109_TIP_2018_2859820
source	IEEE Electronic Library (IEL)
subjects	attention model Computational modeling Knowledge discovery Natural languages scene understanding Semantics Sentences Syntactics Task analysis Trees Video question answering Visualization
title	A Better Way to Attend: Attention With Trees for Video Question Answering
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-14T07%3A18%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20Better%20Way%20to%20Attend:%20Attention%20With%20Trees%20for%20Video%20Question%20Answering&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Xue,%20Hongyang&rft.date=2018-11-01&rft.volume=27&rft.issue=11&rft.spage=5563&rft.epage=5574&rft.pages=5563-5574&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2018.2859820&rft_dat=%3Cproquest_RIE%3E2076912838%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2088947892&rft_id=info:pmid/30047882&rft_ieee_id=8419716&rfr_iscdi=true