Global and Local Knowledge-Aware Attention Network for Action Recognition
Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the...
Gespeichert in:
Veröffentlicht in: | IEEE transaction on neural networks and learning systems 2021-01, Vol.32 (1), p.334-347 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 347 |
---|---|
container_issue | 1 |
container_start_page | 334 |
container_title | IEEE transaction on neural networks and learning systems |
container_volume | 32 |
creator | Zheng, Zhenxing An, Gaoyun Wu, Dapeng Ruan, Qiuqi |
description | Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods. |
doi_str_mv | 10.1109/TNNLS.2020.2978613 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_2384822692</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9050644</ieee_id><sourcerecordid>2475960104</sourcerecordid><originalsourceid>FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</originalsourceid><addsrcrecordid>eNpdkE1LAzEQhoMoWtQ_oCALXrxsTSbZbHIsxS8sFfwAb2E3OylbtxvNbin-e1NbezCXvEyeGSYPIWeMDhmj-vp1Op28DIECHYLOlWR8jwyASUiBK7W_y_n7ETntujmNR9JMCn1IjjgACCGzAXm4a3xZNEnRVsnE25geW79qsJphOloVAZNR32Pb175NptivfPhInA_JyP6WntH6WVuv8wk5cEXT4en2PiZvtzev4_t08nT3MB5NUssz1qfglMNMSyZkCVmZO1eBzSnXiNQJS7WrRFnmrNBCVbZSiiEoFJxipRA548fkajP3M_ivJXa9WdSdxaYpWvTLzsTvCwUgNUT08h8698vQxu0MiDwuQRkVkYINZYPvuoDOfIZ6UYRvw6hZuza_rs3atdm6jk0X29HLcoHVruXPbATON0CNiLtnTTMqheA_2EGBzA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2475960104</pqid></control><display><type>article</type><title>Global and Local Knowledge-Aware Attention Network for Action Recognition</title><source>IEEE Electronic Library (IEL)</source><creator>Zheng, Zhenxing ; An, Gaoyun ; Wu, Dapeng ; Ruan, Qiuqi</creator><creatorcontrib>Zheng, Zhenxing ; An, Gaoyun ; Wu, Dapeng ; Ruan, Qiuqi</creatorcontrib><description>Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.</description><identifier>ISSN: 2162-237X</identifier><identifier>EISSN: 2162-2388</identifier><identifier>DOI: 10.1109/TNNLS.2020.2978613</identifier><identifier>PMID: 32224465</identifier><identifier>CODEN: ITNNAL</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Action recognition ; Algorithms ; Artificial neural networks ; Attention ; attention mechanism ; Benchmarking ; Benchmarks ; Biological system modeling ; Computer Systems ; convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework ; Data models ; Databases, Factual ; Feature extraction ; Humans ; Image Processing, Computer-Assisted ; Information science ; Knowledge ; Machine Learning ; Movement ; Neural networks ; Neural Networks, Computer ; Pattern Recognition, Automated - methods ; Recognition ; Regularization ; Reproducibility of Results ; spatiotemporal feature ; Spatiotemporal phenomena ; Streams ; Task analysis ; Videos</subject><ispartof>IEEE transaction on neural networks and learning systems, 2021-01, Vol.32 (1), p.334-347</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</citedby><cites>FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</cites><orcidid>0000-0002-2843-843X ; 0000-0002-6749-4952</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9050644$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9050644$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/32224465$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zheng, Zhenxing</creatorcontrib><creatorcontrib>An, Gaoyun</creatorcontrib><creatorcontrib>Wu, Dapeng</creatorcontrib><creatorcontrib>Ruan, Qiuqi</creatorcontrib><title>Global and Local Knowledge-Aware Attention Network for Action Recognition</title><title>IEEE transaction on neural networks and learning systems</title><addtitle>TNNLS</addtitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><description>Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.</description><subject>Action recognition</subject><subject>Algorithms</subject><subject>Artificial neural networks</subject><subject>Attention</subject><subject>attention mechanism</subject><subject>Benchmarking</subject><subject>Benchmarks</subject><subject>Biological system modeling</subject><subject>Computer Systems</subject><subject>convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework</subject><subject>Data models</subject><subject>Databases, Factual</subject><subject>Feature extraction</subject><subject>Humans</subject><subject>Image Processing, Computer-Assisted</subject><subject>Information science</subject><subject>Knowledge</subject><subject>Machine Learning</subject><subject>Movement</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Pattern Recognition, Automated - methods</subject><subject>Recognition</subject><subject>Regularization</subject><subject>Reproducibility of Results</subject><subject>spatiotemporal feature</subject><subject>Spatiotemporal phenomena</subject><subject>Streams</subject><subject>Task analysis</subject><subject>Videos</subject><issn>2162-237X</issn><issn>2162-2388</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><sourceid>EIF</sourceid><recordid>eNpdkE1LAzEQhoMoWtQ_oCALXrxsTSbZbHIsxS8sFfwAb2E3OylbtxvNbin-e1NbezCXvEyeGSYPIWeMDhmj-vp1Op28DIECHYLOlWR8jwyASUiBK7W_y_n7ETntujmNR9JMCn1IjjgACCGzAXm4a3xZNEnRVsnE25geW79qsJphOloVAZNR32Pb175NptivfPhInA_JyP6WntH6WVuv8wk5cEXT4en2PiZvtzev4_t08nT3MB5NUssz1qfglMNMSyZkCVmZO1eBzSnXiNQJS7WrRFnmrNBCVbZSiiEoFJxipRA548fkajP3M_ivJXa9WdSdxaYpWvTLzsTvCwUgNUT08h8698vQxu0MiDwuQRkVkYINZYPvuoDOfIZ6UYRvw6hZuza_rs3atdm6jk0X29HLcoHVruXPbATON0CNiLtnTTMqheA_2EGBzA</recordid><startdate>202101</startdate><enddate>202101</enddate><creator>Zheng, Zhenxing</creator><creator>An, Gaoyun</creator><creator>Wu, Dapeng</creator><creator>Ruan, Qiuqi</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QP</scope><scope>7QQ</scope><scope>7QR</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7TK</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-2843-843X</orcidid><orcidid>https://orcid.org/0000-0002-6749-4952</orcidid></search><sort><creationdate>202101</creationdate><title>Global and Local Knowledge-Aware Attention Network for Action Recognition</title><author>Zheng, Zhenxing ; An, Gaoyun ; Wu, Dapeng ; Ruan, Qiuqi</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c351t-2f8fe596146b25b7ffd2c7039ee0f4c09fd4bb71a948dcd881e28e430ed8ee313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Action recognition</topic><topic>Algorithms</topic><topic>Artificial neural networks</topic><topic>Attention</topic><topic>attention mechanism</topic><topic>Benchmarking</topic><topic>Benchmarks</topic><topic>Biological system modeling</topic><topic>Computer Systems</topic><topic>convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework</topic><topic>Data models</topic><topic>Databases, Factual</topic><topic>Feature extraction</topic><topic>Humans</topic><topic>Image Processing, Computer-Assisted</topic><topic>Information science</topic><topic>Knowledge</topic><topic>Machine Learning</topic><topic>Movement</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Pattern Recognition, Automated - methods</topic><topic>Recognition</topic><topic>Regularization</topic><topic>Reproducibility of Results</topic><topic>spatiotemporal feature</topic><topic>Spatiotemporal phenomena</topic><topic>Streams</topic><topic>Task analysis</topic><topic>Videos</topic><toplevel>online_resources</toplevel><creatorcontrib>Zheng, Zhenxing</creatorcontrib><creatorcontrib>An, Gaoyun</creatorcontrib><creatorcontrib>Wu, Dapeng</creatorcontrib><creatorcontrib>Ruan, Qiuqi</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Calcium & Calcified Tissue Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Chemoreception Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical & Transportation Engineering Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology & Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transaction on neural networks and learning systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zheng, Zhenxing</au><au>An, Gaoyun</au><au>Wu, Dapeng</au><au>Ruan, Qiuqi</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Global and Local Knowledge-Aware Attention Network for Action Recognition</atitle><jtitle>IEEE transaction on neural networks and learning systems</jtitle><stitle>TNNLS</stitle><addtitle>IEEE Trans Neural Netw Learn Syst</addtitle><date>2021-01</date><risdate>2021</risdate><volume>32</volume><issue>1</issue><spage>334</spage><epage>347</epage><pages>334-347</pages><issn>2162-237X</issn><eissn>2162-2388</eissn><coden>ITNNAL</coden><abstract>Convolutional neural networks (CNNs) have shown an effective way to learn spatiotemporal representation for action recognition in videos. However, most traditional action recognition algorithms do not employ the attention mechanism to focus on essential parts of video frames that are relevant to the action. In this article, we propose a novel global and local knowledge-aware attention network to address this challenge for action recognition. The proposed network incorporates two types of attention mechanism called statistic-based attention (SA) and learning-based attention (LA) to attach higher importance to the crucial elements in each video frame. As global pooling (GP) models capture global information, while attention models focus on the significant details to make full use of their implicit complementary advantages, our network adopts a three-stream architecture, including two attention streams and a GP stream. Each attention stream employs a fusion layer to combine global and local information and produces composite features. Furthermore, global-attention (GA) regularization is proposed to guide two attention streams to better model dynamics of composite features with the reference to the global information. Fusion at the softmax layer is adopted to make better use of the implicit complementary advantages between SA, LA, and GP streams and get the final comprehensive predictions. The proposed network is trained in an end-to-end fashion and learns efficient video-level features both spatially and temporally. Extensive experiments are conducted on three challenging benchmarks, Kinetics, HMDB51, and UCF101, and experimental results demonstrate that the proposed network outperforms most state-of-the-art methods.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>32224465</pmid><doi>10.1109/TNNLS.2020.2978613</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-2843-843X</orcidid><orcidid>https://orcid.org/0000-0002-6749-4952</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2162-237X |
ispartof | IEEE transaction on neural networks and learning systems, 2021-01, Vol.32 (1), p.334-347 |
issn | 2162-237X 2162-2388 |
language | eng |
recordid | cdi_proquest_miscellaneous_2384822692 |
source | IEEE Electronic Library (IEL) |
subjects | Action recognition Algorithms Artificial neural networks Attention attention mechanism Benchmarking Benchmarks Biological system modeling Computer Systems convolutional neural networks-recurrent neural networks (CNNs-RNNs) framework Data models Databases, Factual Feature extraction Humans Image Processing, Computer-Assisted Information science Knowledge Machine Learning Movement Neural networks Neural Networks, Computer Pattern Recognition, Automated - methods Recognition Regularization Reproducibility of Results spatiotemporal feature Spatiotemporal phenomena Streams Task analysis Videos |
title | Global and Local Knowledge-Aware Attention Network for Action Recognition |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T15%3A03%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Global%20and%20Local%20Knowledge-Aware%20Attention%20Network%20for%20Action%20Recognition&rft.jtitle=IEEE%20transaction%20on%20neural%20networks%20and%20learning%20systems&rft.au=Zheng,%20Zhenxing&rft.date=2021-01&rft.volume=32&rft.issue=1&rft.spage=334&rft.epage=347&rft.pages=334-347&rft.issn=2162-237X&rft.eissn=2162-2388&rft.coden=ITNNAL&rft_id=info:doi/10.1109/TNNLS.2020.2978613&rft_dat=%3Cproquest_RIE%3E2475960104%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2475960104&rft_id=info:pmid/32224465&rft_ieee_id=9050644&rfr_iscdi=true |