2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs

Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Mem...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2020-10, Vol.22 (10), p.2481-2496
Hauptverfasser: Avola, Danilo, Cascio, Marco, Cinque, Luigi, Foresti, Gian Luca, Massaroni, Cristiano, Rodola, Emanuele
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 2496
container_issue 10
container_start_page 2481
container_title IEEE transactions on multimedia
container_volume 22
creator Avola, Danilo
Cascio, Marco
Cinque, Luigi
Foresti, Gian Luca
Massaroni, Cristiano
Rodola, Emanuele
description Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.
doi_str_mv 10.1109/TMM.2019.2960588
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2446058559</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8936339</ieee_id><sourcerecordid>2446058559</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</originalsourceid><addsrcrecordid>eNo9kE1PAjEQhhujiYjeTbxs4rk47bbb7RFQ1AQwgfXclDKrC7jF7aLx31uEeJo5PO98PIRcM-gxBvqumEx6HJjucZ2BzPMT0mFaMAqg1GnsJQeqOYNzchHCCoAJCapDRpzeJ_M1brD1NR3YgMuk79rK18kMnX-rq7_-q7JJ8e3poLG1e0_mrXXrSI7nxYTOptNwSc5Kuwl4daxd8jp6KIZPdPzy-Dzsj6njmrV0mTORZU6gkjwTLHcaFlqgA4uidLkqZTxLpsiVUFbBAhwKWGQOl5ajEmXaJbeHudvGf-4wtGbld00dVxouxP5xKXWk4EC5xofQYGm2TfVhmx_DwOxtmWjL7G2Zo60YuTlEKkT8x3OdZmmq0194RWNL</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2446058559</pqid></control><display><type>article</type><title>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</title><source>IEEE/IET Electronic Library</source><creator>Avola, Danilo ; Cascio, Marco ; Cinque, Luigi ; Foresti, Gian Luca ; Massaroni, Cristiano ; Rodola, Emanuele</creator><creatorcontrib>Avola, Danilo ; Cascio, Marco ; Cinque, Luigi ; Foresti, Gian Luca ; Massaroni, Cristiano ; Rodola, Emanuele</creatorcontrib><description>Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2019.2960588</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>2D skeleton ; Action recognition ; Artificial neural networks ; Cameras ; Computer vision ; Datasets ; Feature extraction ; Indoor environments ; long short-term memory (LSTM) ; Neural networks ; Recognition ; Recurrent neural networks ; recurrent neural networks (RNNs) ; Skeleton ; Sports ; Streaming media ; Three-dimensional displays ; Two dimensional displays ; Video data</subject><ispartof>IEEE transactions on multimedia, 2020-10, Vol.22 (10), p.2481-2496</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</citedby><cites>FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</cites><orcidid>0000-0001-9437-6217 ; 0000-0002-8425-6892 ; 0000-0002-6942-4851 ; 0000-0003-0091-7241</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8936339$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8936339$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Avola, Danilo</creatorcontrib><creatorcontrib>Cascio, Marco</creatorcontrib><creatorcontrib>Cinque, Luigi</creatorcontrib><creatorcontrib>Foresti, Gian Luca</creatorcontrib><creatorcontrib>Massaroni, Cristiano</creatorcontrib><creatorcontrib>Rodola, Emanuele</creatorcontrib><title>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.</description><subject>2D skeleton</subject><subject>Action recognition</subject><subject>Artificial neural networks</subject><subject>Cameras</subject><subject>Computer vision</subject><subject>Datasets</subject><subject>Feature extraction</subject><subject>Indoor environments</subject><subject>long short-term memory (LSTM)</subject><subject>Neural networks</subject><subject>Recognition</subject><subject>Recurrent neural networks</subject><subject>recurrent neural networks (RNNs)</subject><subject>Skeleton</subject><subject>Sports</subject><subject>Streaming media</subject><subject>Three-dimensional displays</subject><subject>Two dimensional displays</subject><subject>Video data</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1PAjEQhhujiYjeTbxs4rk47bbb7RFQ1AQwgfXclDKrC7jF7aLx31uEeJo5PO98PIRcM-gxBvqumEx6HJjucZ2BzPMT0mFaMAqg1GnsJQeqOYNzchHCCoAJCapDRpzeJ_M1brD1NR3YgMuk79rK18kMnX-rq7_-q7JJ8e3poLG1e0_mrXXrSI7nxYTOptNwSc5Kuwl4daxd8jp6KIZPdPzy-Dzsj6njmrV0mTORZU6gkjwTLHcaFlqgA4uidLkqZTxLpsiVUFbBAhwKWGQOl5ajEmXaJbeHudvGf-4wtGbld00dVxouxP5xKXWk4EC5xofQYGm2TfVhmx_DwOxtmWjL7G2Zo60YuTlEKkT8x3OdZmmq0194RWNL</recordid><startdate>20201001</startdate><enddate>20201001</enddate><creator>Avola, Danilo</creator><creator>Cascio, Marco</creator><creator>Cinque, Luigi</creator><creator>Foresti, Gian Luca</creator><creator>Massaroni, Cristiano</creator><creator>Rodola, Emanuele</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-9437-6217</orcidid><orcidid>https://orcid.org/0000-0002-8425-6892</orcidid><orcidid>https://orcid.org/0000-0002-6942-4851</orcidid><orcidid>https://orcid.org/0000-0003-0091-7241</orcidid></search><sort><creationdate>20201001</creationdate><title>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</title><author>Avola, Danilo ; Cascio, Marco ; Cinque, Luigi ; Foresti, Gian Luca ; Massaroni, Cristiano ; Rodola, Emanuele</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>2D skeleton</topic><topic>Action recognition</topic><topic>Artificial neural networks</topic><topic>Cameras</topic><topic>Computer vision</topic><topic>Datasets</topic><topic>Feature extraction</topic><topic>Indoor environments</topic><topic>long short-term memory (LSTM)</topic><topic>Neural networks</topic><topic>Recognition</topic><topic>Recurrent neural networks</topic><topic>recurrent neural networks (RNNs)</topic><topic>Skeleton</topic><topic>Sports</topic><topic>Streaming media</topic><topic>Three-dimensional displays</topic><topic>Two dimensional displays</topic><topic>Video data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Avola, Danilo</creatorcontrib><creatorcontrib>Cascio, Marco</creatorcontrib><creatorcontrib>Cinque, Luigi</creatorcontrib><creatorcontrib>Foresti, Gian Luca</creatorcontrib><creatorcontrib>Massaroni, Cristiano</creatorcontrib><creatorcontrib>Rodola, Emanuele</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Avola, Danilo</au><au>Cascio, Marco</au><au>Cinque, Luigi</au><au>Foresti, Gian Luca</au><au>Massaroni, Cristiano</au><au>Rodola, Emanuele</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2020-10-01</date><risdate>2020</risdate><volume>22</volume><issue>10</issue><spage>2481</spage><epage>2496</epage><pages>2481-2496</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2019.2960588</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0001-9437-6217</orcidid><orcidid>https://orcid.org/0000-0002-8425-6892</orcidid><orcidid>https://orcid.org/0000-0002-6942-4851</orcidid><orcidid>https://orcid.org/0000-0003-0091-7241</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1520-9210
ispartof IEEE transactions on multimedia, 2020-10, Vol.22 (10), p.2481-2496
issn 1520-9210
1941-0077
language eng
recordid cdi_proquest_journals_2446058559
source IEEE/IET Electronic Library
subjects 2D skeleton
Action recognition
Artificial neural networks
Cameras
Computer vision
Datasets
Feature extraction
Indoor environments
long short-term memory (LSTM)
Neural networks
Recognition
Recurrent neural networks
recurrent neural networks (RNNs)
Skeleton
Sports
Streaming media
Three-dimensional displays
Two dimensional displays
Video data
title 2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T14%3A55%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=2-D%20Skeleton-Based%20Action%20Recognition%20via%20Two-Branch%20Stacked%20LSTM-RNNs&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Avola,%20Danilo&rft.date=2020-10-01&rft.volume=22&rft.issue=10&rft.spage=2481&rft.epage=2496&rft.pages=2481-2496&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2019.2960588&rft_dat=%3Cproquest_RIE%3E2446058559%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2446058559&rft_id=info:pmid/&rft_ieee_id=8936339&rfr_iscdi=true