Global and Local Knowledge-Aware Attention Network for Action Recognition

Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transaction on neural networks and learning systems 2021-01, Vol.32 (1), p.334-347
Hauptverfasser: Zheng, Zhenxing, An, Gaoyun, Wu, Dapeng, Ruan, Qiuqi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 347
container_issue 1
container_start_page 334
container_title IEEE transaction on neural networks and learning systems
container_volume 32
creator Zheng, Zhenxing
An, Gaoyun
Wu, Dapeng
Ruan, Qiuqi
description Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.
doi_str_mv 10.1109/TNNLS.2020.2978613
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_2384822692</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9050644</ieee_id><sourcerecordid>2475960104</sourcerecordid><originalsourceid>FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</originalsourceid><addsrcrecordid>eNpdkE1LAzEQhoMoWtQ_oCALXrxsTSbZbHIsxS8sFfwAb2E3OylbtxvNbin-e1NbezCXvEyeGSYPIWeMDhmj-vp1Op28DIECHYLOlWR8jwyASUiBK7W_y_n7ETntujmNR9JMCn1IjjgACCGzAXm4a3xZNEnRVsnE25geW79qsJphOloVAZNR32Pb175NptivfPhInA_JyP6WntH6WVuv8wk5cEXT4en2PiZvtzev4_t08nT3MB5NUssz1qfglMNMSyZkCVmZO1eBzSnXiNQJS7WrRFnmrNBCVbZSiiEoFJxipRA548fkajP3M_ivJXa9WdSdxaYpWvTLzsTvCwUgNUT08h8698vQxu0MiDwuQRkVkYINZYPvuoDOfIZ6UYRvw6hZuza_rs3atdm6jk0X29HLcoHVruXPbATON0CNiLtnTTMqheA_2EGBzA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2475960104</pqid></control><display><type>article</type><title>Global and Local Knowledge-Aware Attention Network for Action Recognition</title><source>IEEE Electronic Library (IEL)</source><creator>Zheng, Zhenxing ; An, Gaoyun ; Wu, Dapeng ; Ruan, Qiuqi</creator><creatorcontrib>Zheng, Zhenxing ; An, Gaoyun ; Wu, Dapeng ; Ruan, Qiuqi</creatorcontrib><description>Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.</description><identifier>ISSN: 2162-237X</identifier><identifier>EISSN: 2162-2388</identifier><identifier>DOI: 10.1109/TNNLS.2020.2978613</identifier><identifier>PMID: 32224465</identifier><identifier>CODEN: ITNNAL</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Action recognition ; Algorithms ; Artificial neural networks ; Attention ; attention mechanism ; Benchmarking ; Benchmarks ; Biological system modeling ; Computer Systems ; convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework ; Data models ; Databases, Factual ; Feature extraction ; Humans ; Image Processing, Computer-Assisted ; Information science ; Knowledge ; Machine Learning ; Movement ; Neural networks ; Neural Networks, Computer ; Pattern Recognition, Automated - methods ; Recognition ; Regularization ; Reproducibility of Results ; spatiotemporal feature ; Spatiotemporal phenomena ; Streams ; Task analysis ; Videos</subject><ispartof>IEEE transaction on neural networks and learning systems, 2021-01, Vol.32 (1), p.334-347</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</citedby><cites>FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</cites><orcidid>0000-0002-2843-843X ; 0000-0002-6749-4952</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9050644$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9050644$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32224465$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zheng, Zhenxing</creatorcontrib><creatorcontrib>An, Gaoyun</creatorcontrib><creatorcontrib>Wu, Dapeng</creatorcontrib><creatorcontrib>Ruan, Qiuqi</creatorcontrib><title>Global and Local Knowledge-Aware Attention Network for Action Recognition</title><title>IEEE transaction on neural networks and learning systems</title><addtitle>TNNLS</addtitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><description>Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.</description><subject>Action recognition</subject><subject>Algorithms</subject><subject>Artificial neural networks</subject><subject>Attention</subject><subject>attention mechanism</subject><subject>Benchmarking</subject><subject>Benchmarks</subject><subject>Biological system modeling</subject><subject>Computer Systems</subject><subject>convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework</subject><subject>Data models</subject><subject>Databases, Factual</subject><subject>Feature extraction</subject><subject>Humans</subject><subject>Image Processing, Computer-Assisted</subject><subject>Information science</subject><subject>Knowledge</subject><subject>Machine Learning</subject><subject>Movement</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Pattern Recognition, Automated - methods</subject><subject>Recognition</subject><subject>Regularization</subject><subject>Reproducibility of Results</subject><subject>spatiotemporal feature</subject><subject>Spatiotemporal phenomena</subject><subject>Streams</subject><subject>Task analysis</subject><subject>Videos</subject><issn>2162-237X</issn><issn>2162-2388</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><sourceid>EIF</sourceid><recordid>eNpdkE1LAzEQhoMoWtQ_oCALXrxsTSbZbHIsxS8sFfwAb2E3OylbtxvNbin-e1NbezCXvEyeGSYPIWeMDhmj-vp1Op28DIECHYLOlWR8jwyASUiBK7W_y_n7ETntujmNR9JMCn1IjjgACCGzAXm4a3xZNEnRVsnE25geW79qsJphOloVAZNR32Pb175NptivfPhInA_JyP6WntH6WVuv8wk5cEXT4en2PiZvtzev4_t08nT3MB5NUssz1qfglMNMSyZkCVmZO1eBzSnXiNQJS7WrRFnmrNBCVbZSiiEoFJxipRA548fkajP3M_ivJXa9WdSdxaYpWvTLzsTvCwUgNUT08h8698vQxu0MiDwuQRkVkYINZYPvuoDOfIZ6UYRvw6hZuza_rs3atdm6jk0X29HLcoHVruXPbATON0CNiLtnTTMqheA_2EGBzA</recordid><startdate>202101</startdate><enddate>202101</enddate><creator>Zheng, Zhenxing</creator><creator>An, Gaoyun</creator><creator>Wu, Dapeng</creator><creator>Ruan, Qiuqi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QP</scope><scope>7QQ</scope><scope>7QR</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7TK</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-2843-843X</orcidid><orcidid>https://orcid.org/0000-0002-6749-4952</orcidid></search><sort><creationdate>202101</creationdate><title>Global and Local Knowledge-Aware Attention Network for Action Recognition</title><author>Zheng, Zhenxing ; An, Gaoyun ; Wu, Dapeng ; Ruan, Qiuqi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Action recognition</topic><topic>Algorithms</topic><topic>Artificial neural networks</topic><topic>Attention</topic><topic>attention mechanism</topic><topic>Benchmarking</topic><topic>Benchmarks</topic><topic>Biological system modeling</topic><topic>Computer Systems</topic><topic>convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework</topic><topic>Data models</topic><topic>Databases, Factual</topic><topic>Feature extraction</topic><topic>Humans</topic><topic>Image Processing, Computer-Assisted</topic><topic>Information science</topic><topic>Knowledge</topic><topic>Machine Learning</topic><topic>Movement</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Pattern Recognition, Automated - methods</topic><topic>Recognition</topic><topic>Regularization</topic><topic>Reproducibility of Results</topic><topic>spatiotemporal feature</topic><topic>Spatiotemporal phenomena</topic><topic>Streams</topic><topic>Task analysis</topic><topic>Videos</topic><toplevel>online_resources</toplevel><creatorcontrib>Zheng, Zhenxing</creatorcontrib><creatorcontrib>An, Gaoyun</creatorcontrib><creatorcontrib>Wu, Dapeng</creatorcontrib><creatorcontrib>Ruan, Qiuqi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium &amp; Calcified Tissue Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transaction on neural networks and learning systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zheng, Zhenxing</au><au>An, Gaoyun</au><au>Wu, Dapeng</au><au>Ruan, Qiuqi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Global and Local Knowledge-Aware Attention Network for Action Recognition</atitle><jtitle>IEEE transaction on neural networks and learning systems</jtitle><stitle>TNNLS</stitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><date>2021-01</date><risdate>2021</risdate><volume>32</volume><issue>1</issue><spage>334</spage><epage>347</epage><pages>334-347</pages><issn>2162-237X</issn><eissn>2162-2388</eissn><coden>ITNNAL</coden><abstract>Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>32224465</pmid><doi>10.1109/TNNLS.2020.2978613</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-2843-843X</orcidid><orcidid>https://orcid.org/0000-0002-6749-4952</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2162-237X
ispartof IEEE transaction on neural networks and learning systems, 2021-01, Vol.32 (1), p.334-347
issn 2162-237X
2162-2388
language eng
recordid cdi_proquest_miscellaneous_2384822692
source IEEE Electronic Library (IEL)
subjects Action recognition
Algorithms
Artificial neural networks
Attention
attention mechanism
Benchmarking
Benchmarks
Biological system modeling
Computer Systems
convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework
Data models
Databases, Factual
Feature extraction
Humans
Image Processing, Computer-Assisted
Information science
Knowledge
Machine Learning
Movement
Neural networks
Neural Networks, Computer
Pattern Recognition, Automated - methods
Recognition
Regularization
Reproducibility of Results
spatiotemporal feature
Spatiotemporal phenomena
Streams
Task analysis
Videos
title Global and Local Knowledge-Aware Attention Network for Action Recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T15%3A03%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Global%20and%20Local%20Knowledge-Aware%20Attention%20Network%20for%20Action%20Recognition&rft.jtitle=IEEE%20transaction%20on%20neural%20networks%20and%20learning%20systems&rft.au=Zheng,%20Zhenxing&rft.date=2021-01&rft.volume=32&rft.issue=1&rft.spage=334&rft.epage=347&rft.pages=334-347&rft.issn=2162-237X&rft.eissn=2162-2388&rft.coden=ITNNAL&rft_id=info:doi/10.1109/TNNLS.2020.2978613&rft_dat=%3Cproquest_RIE%3E2475960104%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2475960104&rft_id=info:pmid/32224465&rft_ieee_id=9050644&rfr_iscdi=true