Global and Local Knowledge-Aware Attention Network for Action Recognition

Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transaction on neural networks and learning systems 2021-01, Vol.32 (1), p.334-347
Hauptverfasser:	Zheng, Zhenxing, An, Gaoyun, Wu, Dapeng, Ruan, Qiuqi
Format:	Artikel
Sprache:	eng
Schlagworte:	Action recognition Algorithms Artificial neural networks Attention attention mechanism Benchmarking Benchmarks Biological system modeling Computer Systems convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework Data models Databases, Factual Feature extraction Humans Image Processing, Computer-Assisted Information science Knowledge Machine Learning Movement Neural networks Neural Networks, Computer Pattern Recognition, Automated - methods Recognition Regularization Reproducibility of Results spatiotemporal feature Spatiotemporal phenomena Streams Task analysis Videos
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	347
container_issue	1
container_start_page	334
container_title	IEEE transaction on neural networks and learning systems
container_volume	32
creator	Zheng, Zhenxing An, Gaoyun Wu, Dapeng Ruan, Qiuqi
description	Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.
doi_str_mv	10.1109/TNNLS.2020.2978613
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_2384822692</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9050644</ieee_id><sourcerecordid>2475960104</sourcerecordid><originalsourceid>FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</originalsourceid><addsrcrecordid>eNpdkE1LAzEQhoMoWtQ_oCALXrxsTSbZbHIsxS8sFfwAb2E3OylbtxvNbin-e1NbezCXvEyeGSYPIWeMDhmj-vp1Op28DIECHYLOlWR8jwyASUiBK7W_y_n7ETntujmNR9JMCn1IjjgACCGzAXm4a3xZNEnRVsnE25geW79qsJphOloVAZNR32Pb175NptivfPhInA_JyP6WntH6WVuv8wk5cEXT4en2PiZvtzev4_t08nT3MB5NUssz1qfglMNMSyZkCVmZO1eBzSnXiNQJS7WrRFnmrNBCVbZSiiEoFJxipRA548fkajP3M_ivJXa9WdSdxaYpWvTLzsTvCwUgNUT08h8698vQxu0MiDwuQRkVkYINZYPvuoDOfIZ6UYRvw6hZuza_rs3atdm6jk0X29HLcoHVruXPbATON0CNiLtnTTMqheA_2EGBzA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2475960104</pqid></control><display><type>article</type><title>Global and Local Knowledge-Aware Attention Network for Action Recognition</title><source>IEEE Electronic Library (IEL)</source><creator>Zheng, Zhenxing ; An, Gaoyun ; Wu, Dapeng ; Ruan, Qiuqi</creator><creatorcontrib>Zheng, Zhenxing ; An, Gaoyun ; Wu, Dapeng ; Ruan, Qiuqi</creatorcontrib><description>Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.</description><identifier>ISSN: 2162-237X</identifier><identifier>EISSN: 2162-2388</identifier><identifier>DOI: 10.1109/TNNLS.2020.2978613</identifier><identifier>PMID: 32224465</identifier><identifier>CODEN: ITNNAL</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Action recognition ; Algorithms ; Artificial neural networks ; Attention ; attention mechanism ; Benchmarking ; Benchmarks ; Biological system modeling ; Computer Systems ; convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework ; Data models ; Databases, Factual ; Feature extraction ; Humans ; Image Processing, Computer-Assisted ; Information science ; Knowledge ; Machine Learning ; Movement ; Neural networks ; Neural Networks, Computer ; Pattern Recognition, Automated - methods ; Recognition ; Regularization ; Reproducibility of Results ; spatiotemporal feature ; Spatiotemporal phenomena ; Streams ; Task analysis ; Videos</subject><ispartof>IEEE transaction on neural networks and learning systems, 2021-01, Vol.32 (1), p.334-347</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</citedby><cites>FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</cites><orcidid>0000-0002-2843-843X ; 0000-0002-6749-4952</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9050644$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9050644$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32224465$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zheng, Zhenxing</creatorcontrib><creatorcontrib>An, Gaoyun</creatorcontrib><creatorcontrib>Wu, Dapeng</creatorcontrib><creatorcontrib>Ruan, Qiuqi</creatorcontrib><title>Global and Local Knowledge-Aware Attention Network for Action Recognition</title><title>IEEE transaction on neural networks and learning systems</title><addtitle>TNNLS</addtitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><description>Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.</description><subject>Action recognition</subject><subject>Algorithms</subject><subject>Artificial neural networks</subject><subject>Attention</subject><subject>attention mechanism</subject><subject>Benchmarking</subject><subject>Benchmarks</subject><subject>Biological system modeling</subject><subject>Computer Systems</subject><subject>convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework</subject><subject>Data models</subject><subject>Databases, Factual</subject><subject>Feature extraction</subject><subject>Humans</subject><subject>Image Processing, Computer-Assisted</subject><subject>Information science</subject><subject>Knowledge</subject><subject>Machine Learning</subject><subject>Movement</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Pattern Recognition, Automated - methods</subject><subject>Recognition</subject><subject>Regularization</subject><subject>Reproducibility of Results</subject><subject>spatiotemporal feature</subject><subject>Spatiotemporal phenomena</subject><subject>Streams</subject><subject>Task analysis</subject><subject>Videos</subject><issn>2162-237X</issn><issn>2162-2388</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><sourceid>EIF</sourceid><recordid>eNpdkE1LAzEQhoMoWtQ_oCALXrxsTSbZbHIsxS8sFfwAb2E3OylbtxvNbin-e1NbezCXvEyeGSYPIWeMDhmj-vp1Op28DIECHYLOlWR8jwyASUiBK7W_y_n7ETntujmNR9JMCn1IjjgACCGzAXm4a3xZNEnRVsnE25geW79qsJphOloVAZNR32Pb175NptivfPhInA_JyP6WntH6WVuv8wk5cEXT4en2PiZvtzev4_t08nT3MB5NUssz1qfglMNMSyZkCVmZO1eBzSnXiNQJS7WrRFnmrNBCVbZSiiEoFJxipRA548fkajP3M_ivJXa9WdSdxaYpWvTLzsTvCwUgNUT08h8698vQxu0MiDwuQRkVkYINZYPvuoDOfIZ6UYRvw6hZuza_rs3atdm6jk0X29HLcoHVruXPbATON0CNiLtnTTMqheA_2EGBzA</recordid><startdate>202101</startdate><enddate>202101</enddate><creator>Zheng, Zhenxing</creator><creator>An, Gaoyun</creator><creator>Wu, Dapeng</creator><creator>Ruan, Qiuqi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QP</scope><scope>7QQ</scope><scope>7QR</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7TK</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-2843-843X</orcidid><orcidid>https://orcid.org/0000-0002-6749-4952</orcidid></search><sort><creationdate>202101</creationdate><title>Global and Local Knowledge-Aware Attention Network for Action Recognition</title><author>Zheng, Zhenxing ; An, Gaoyun ; Wu, Dapeng ; Ruan, Qiuqi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Action recognition</topic><topic>Algorithms</topic><topic>Artificial neural networks</topic><topic>Attention</topic><topic>attention mechanism</topic><topic>Benchmarking</topic><topic>Benchmarks</topic><topic>Biological system modeling</topic><topic>Computer Systems</topic><topic>convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework</topic><topic>Data models</topic><topic>Databases, Factual</topic><topic>Feature extraction</topic><topic>Humans</topic><topic>Image Processing, Computer-Assisted</topic><topic>Information science</topic><topic>Knowledge</topic><topic>Machine Learning</topic><topic>Movement</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Pattern Recognition, Automated - methods</topic><topic>Recognition</topic><topic>Regularization</topic><topic>Reproducibility of Results</topic><topic>spatiotemporal feature</topic><topic>Spatiotemporal phenomena</topic><topic>Streams</topic><topic>Task analysis</topic><topic>Videos</topic><toplevel>online_resources</toplevel><creatorcontrib>Zheng, Zhenxing</creatorcontrib><creatorcontrib>An, Gaoyun</creatorcontrib><creatorcontrib>Wu, Dapeng</creatorcontrib><creatorcontrib>Ruan, Qiuqi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transaction on neural networks and learning systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zheng, Zhenxing</au><au>An, Gaoyun</au><au>Wu, Dapeng</au><au>Ruan, Qiuqi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Global and Local Knowledge-Aware Attention Network for Action Recognition</atitle><jtitle>IEEE transaction on neural networks and learning systems</jtitle><stitle>TNNLS</stitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><date>2021-01</date><risdate>2021</risdate><volume>32</volume><issue>1</issue><spage>334</spage><epage>347</epage><pages>334-347</pages><issn>2162-237X</issn><eissn>2162-2388</eissn><coden>ITNNAL</coden><abstract>Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>32224465</pmid><doi>10.1109/TNNLS.2020.2978613</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-2843-843X</orcidid><orcidid>https://orcid.org/0000-0002-6749-4952</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2162-237X
ispartof	IEEE transaction on neural networks and learning systems, 2021-01, Vol.32 (1), p.334-347
issn	2162-237X 2162-2388
language	eng
recordid	cdi_proquest_miscellaneous_2384822692
source	IEEE Electronic Library (IEL)
subjects	Action recognition Algorithms Artificial neural networks Attention attention mechanism Benchmarking Benchmarks Biological system modeling Computer Systems convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework Data models Databases, Factual Feature extraction Humans Image Processing, Computer-Assisted Information science Knowledge Machine Learning Movement Neural networks Neural Networks, Computer Pattern Recognition, Automated - methods Recognition Regularization Reproducibility of Results spatiotemporal feature Spatiotemporal phenomena Streams Task analysis Videos
title	Global and Local Knowledge-Aware Attention Network for Action Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T15%3A03%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Global%20and%20Local%20Knowledge-Aware%20Attention%20Network%20for%20Action%20Recognition&rft.jtitle=IEEE%20transaction%20on%20neural%20networks%20and%20learning%20systems&rft.au=Zheng,%20Zhenxing&rft.date=2021-01&rft.volume=32&rft.issue=1&rft.spage=334&rft.epage=347&rft.pages=334-347&rft.issn=2162-237X&rft.eissn=2162-2388&rft.coden=ITNNAL&rft_id=info:doi/10.1109/TNNLS.2020.2978613&rft_dat=%3Cproquest_RIE%3E2475960104%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2475960104&rft_id=info:pmid/32224465&rft_ieee_id=9050644&rfr_iscdi=true