2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs

Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Mem...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2020-10, Vol.22 (10), p.2481-2496
Hauptverfasser:	Avola, Danilo, Cascio, Marco, Cinque, Luigi, Foresti, Gian Luca, Massaroni, Cristiano, Rodola, Emanuele
Format:	Artikel
Sprache:	eng
Schlagworte:	2D skeleton Action recognition Artificial neural networks Cameras Computer vision Datasets Feature extraction Indoor environments long short-term memory (LSTM) Neural networks Recognition Recurrent neural networks recurrent neural networks (RNNs) Skeleton Sports Streaming media Three-dimensional displays Two dimensional displays Video data
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	2496
container_issue	10
container_start_page	2481
container_title	IEEE transactions on multimedia
container_volume	22
creator	Avola, Danilo Cascio, Marco Cinque, Luigi Foresti, Gian Luca Massaroni, Cristiano Rodola, Emanuele
description	Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.
doi_str_mv	10.1109/TMM.2019.2960588
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2446058559</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8936339</ieee_id><sourcerecordid>2446058559</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</originalsourceid><addsrcrecordid>eNo9kE1PAjEQhhujiYjeTbxs4rk47bbb7RFQ1AQwgfXclDKrC7jF7aLx31uEeJo5PO98PIRcM-gxBvqumEx6HJjucZ2BzPMT0mFaMAqg1GnsJQeqOYNzchHCCoAJCapDRpzeJ_M1brD1NR3YgMuk79rK18kMnX-rq7_-q7JJ8e3poLG1e0_mrXXrSI7nxYTOptNwSc5Kuwl4daxd8jp6KIZPdPzy-Dzsj6njmrV0mTORZU6gkjwTLHcaFlqgA4uidLkqZTxLpsiVUFbBAhwKWGQOl5ajEmXaJbeHudvGf-4wtGbld00dVxouxP5xKXWk4EC5xofQYGm2TfVhmx_DwOxtmWjL7G2Zo60YuTlEKkT8x3OdZmmq0194RWNL</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2446058559</pqid></control><display><type>article</type><title>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</title><source>IEEE/IET Electronic Library</source><creator>Avola, Danilo ; Cascio, Marco ; Cinque, Luigi ; Foresti, Gian Luca ; Massaroni, Cristiano ; Rodola, Emanuele</creator><creatorcontrib>Avola, Danilo ; Cascio, Marco ; Cinque, Luigi ; Foresti, Gian Luca ; Massaroni, Cristiano ; Rodola, Emanuele</creatorcontrib><description>Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2019.2960588</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>2D skeleton ; Action recognition ; Artificial neural networks ; Cameras ; Computer vision ; Datasets ; Feature extraction ; Indoor environments ; long short-term memory (LSTM) ; Neural networks ; Recognition ; Recurrent neural networks ; recurrent neural networks (RNNs) ; Skeleton ; Sports ; Streaming media ; Three-dimensional displays ; Two dimensional displays ; Video data</subject><ispartof>IEEE transactions on multimedia, 2020-10, Vol.22 (10), p.2481-2496</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</citedby><cites>FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</cites><orcidid>0000-0001-9437-6217 ; 0000-0002-8425-6892 ; 0000-0002-6942-4851 ; 0000-0003-0091-7241</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8936339$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8936339$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Avola, Danilo</creatorcontrib><creatorcontrib>Cascio, Marco</creatorcontrib><creatorcontrib>Cinque, Luigi</creatorcontrib><creatorcontrib>Foresti, Gian Luca</creatorcontrib><creatorcontrib>Massaroni, Cristiano</creatorcontrib><creatorcontrib>Rodola, Emanuele</creatorcontrib><title>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.</description><subject>2D skeleton</subject><subject>Action recognition</subject><subject>Artificial neural networks</subject><subject>Cameras</subject><subject>Computer vision</subject><subject>Datasets</subject><subject>Feature extraction</subject><subject>Indoor environments</subject><subject>long short-term memory (LSTM)</subject><subject>Neural networks</subject><subject>Recognition</subject><subject>Recurrent neural networks</subject><subject>recurrent neural networks (RNNs)</subject><subject>Skeleton</subject><subject>Sports</subject><subject>Streaming media</subject><subject>Three-dimensional displays</subject><subject>Two dimensional displays</subject><subject>Video data</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1PAjEQhhujiYjeTbxs4rk47bbb7RFQ1AQwgfXclDKrC7jF7aLx31uEeJo5PO98PIRcM-gxBvqumEx6HJjucZ2BzPMT0mFaMAqg1GnsJQeqOYNzchHCCoAJCapDRpzeJ_M1brD1NR3YgMuk79rK18kMnX-rq7_-q7JJ8e3poLG1e0_mrXXrSI7nxYTOptNwSc5Kuwl4daxd8jp6KIZPdPzy-Dzsj6njmrV0mTORZU6gkjwTLHcaFlqgA4uidLkqZTxLpsiVUFbBAhwKWGQOl5ajEmXaJbeHudvGf-4wtGbld00dVxouxP5xKXWk4EC5xofQYGm2TfVhmx_DwOxtmWjL7G2Zo60YuTlEKkT8x3OdZmmq0194RWNL</recordid><startdate>20201001</startdate><enddate>20201001</enddate><creator>Avola, Danilo</creator><creator>Cascio, Marco</creator><creator>Cinque, Luigi</creator><creator>Foresti, Gian Luca</creator><creator>Massaroni, Cristiano</creator><creator>Rodola, Emanuele</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-9437-6217</orcidid><orcidid>https://orcid.org/0000-0002-8425-6892</orcidid><orcidid>https://orcid.org/0000-0002-6942-4851</orcidid><orcidid>https://orcid.org/0000-0003-0091-7241</orcidid></search><sort><creationdate>20201001</creationdate><title>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</title><author>Avola, Danilo ; Cascio, Marco ; Cinque, Luigi ; Foresti, Gian Luca ; Massaroni, Cristiano ; Rodola, Emanuele</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>2D skeleton</topic><topic>Action recognition</topic><topic>Artificial neural networks</topic><topic>Cameras</topic><topic>Computer vision</topic><topic>Datasets</topic><topic>Feature extraction</topic><topic>Indoor environments</topic><topic>long short-term memory (LSTM)</topic><topic>Neural networks</topic><topic>Recognition</topic><topic>Recurrent neural networks</topic><topic>recurrent neural networks (RNNs)</topic><topic>Skeleton</topic><topic>Sports</topic><topic>Streaming media</topic><topic>Three-dimensional displays</topic><topic>Two dimensional displays</topic><topic>Video data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Avola, Danilo</creatorcontrib><creatorcontrib>Cascio, Marco</creatorcontrib><creatorcontrib>Cinque, Luigi</creatorcontrib><creatorcontrib>Foresti, Gian Luca</creatorcontrib><creatorcontrib>Massaroni, Cristiano</creatorcontrib><creatorcontrib>Rodola, Emanuele</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Avola, Danilo</au><au>Cascio, Marco</au><au>Cinque, Luigi</au><au>Foresti, Gian Luca</au><au>Massaroni, Cristiano</au><au>Rodola, Emanuele</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2020-10-01</date><risdate>2020</risdate><volume>22</volume><issue>10</issue><spage>2481</spage><epage>2496</epage><pages>2481-2496</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2019.2960588</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0001-9437-6217</orcidid><orcidid>https://orcid.org/0000-0002-8425-6892</orcidid><orcidid>https://orcid.org/0000-0002-6942-4851</orcidid><orcidid>https://orcid.org/0000-0003-0091-7241</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1520-9210
ispartof	IEEE transactions on multimedia, 2020-10, Vol.22 (10), p.2481-2496
issn	1520-9210 1941-0077
language	eng
recordid	cdi_proquest_journals_2446058559
source	IEEE/IET Electronic Library
subjects	2D skeleton Action recognition Artificial neural networks Cameras Computer vision Datasets Feature extraction Indoor environments long short-term memory (LSTM) Neural networks Recognition Recurrent neural networks recurrent neural networks (RNNs) Skeleton Sports Streaming media Three-dimensional displays Two dimensional displays Video data
title	2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T14%3A55%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=2-D%20Skeleton-Based%20Action%20Recognition%20via%20Two-Branch%20Stacked%20LSTM-RNNs&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Avola,%20Danilo&rft.date=2020-10-01&rft.volume=22&rft.issue=10&rft.spage=2481&rft.epage=2496&rft.pages=2481-2496&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2019.2960588&rft_dat=%3Cproquest_RIE%3E2446058559%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2446058559&rft_id=info:pmid/&rft_ieee_id=8936339&rfr_iscdi=true