2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs
Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Mem...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on multimedia 2020-10, Vol.22 (10), p.2481-2496 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 2496 |
---|---|
container_issue | 10 |
container_start_page | 2481 |
container_title | IEEE transactions on multimedia |
container_volume | 22 |
creator | Avola, Danilo Cascio, Marco Cinque, Luigi Foresti, Gian Luca Massaroni, Cristiano Rodola, Emanuele |
description | Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons. |
doi_str_mv | 10.1109/TMM.2019.2960588 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2446058559</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8936339</ieee_id><sourcerecordid>2446058559</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</originalsourceid><addsrcrecordid>eNo9kE1PAjEQhhujiYjeTbxs4rk47bbb7RFQ1AQwgfXclDKrC7jF7aLx31uEeJo5PO98PIRcM-gxBvqumEx6HJjucZ2BzPMT0mFaMAqg1GnsJQeqOYNzchHCCoAJCapDRpzeJ_M1brD1NR3YgMuk79rK18kMnX-rq7_-q7JJ8e3poLG1e0_mrXXrSI7nxYTOptNwSc5Kuwl4daxd8jp6KIZPdPzy-Dzsj6njmrV0mTORZU6gkjwTLHcaFlqgA4uidLkqZTxLpsiVUFbBAhwKWGQOl5ajEmXaJbeHudvGf-4wtGbld00dVxouxP5xKXWk4EC5xofQYGm2TfVhmx_DwOxtmWjL7G2Zo60YuTlEKkT8x3OdZmmq0194RWNL</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2446058559</pqid></control><display><type>article</type><title>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</title><source>IEEE/IET Electronic Library</source><creator>Avola, Danilo ; Cascio, Marco ; Cinque, Luigi ; Foresti, Gian Luca ; Massaroni, Cristiano ; Rodola, Emanuele</creator><creatorcontrib>Avola, Danilo ; Cascio, Marco ; Cinque, Luigi ; Foresti, Gian Luca ; Massaroni, Cristiano ; Rodola, Emanuele</creatorcontrib><description>Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2019.2960588</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>2D skeleton ; Action recognition ; Artificial neural networks ; Cameras ; Computer vision ; Datasets ; Feature extraction ; Indoor environments ; long short-term memory (LSTM) ; Neural networks ; Recognition ; Recurrent neural networks ; recurrent neural networks (RNNs) ; Skeleton ; Sports ; Streaming media ; Three-dimensional displays ; Two dimensional displays ; Video data</subject><ispartof>IEEE transactions on multimedia, 2020-10, Vol.22 (10), p.2481-2496</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2020</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</citedby><cites>FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</cites><orcidid>0000-0001-9437-6217 ; 0000-0002-8425-6892 ; 0000-0002-6942-4851 ; 0000-0003-0091-7241</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8936339$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27923,27924,54757</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8936339$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Avola, Danilo</creatorcontrib><creatorcontrib>Cascio, Marco</creatorcontrib><creatorcontrib>Cinque, Luigi</creatorcontrib><creatorcontrib>Foresti, Gian Luca</creatorcontrib><creatorcontrib>Massaroni, Cristiano</creatorcontrib><creatorcontrib>Rodola, Emanuele</creatorcontrib><title>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.</description><subject>2D skeleton</subject><subject>Action recognition</subject><subject>Artificial neural networks</subject><subject>Cameras</subject><subject>Computer vision</subject><subject>Datasets</subject><subject>Feature extraction</subject><subject>Indoor environments</subject><subject>long short-term memory (LSTM)</subject><subject>Neural networks</subject><subject>Recognition</subject><subject>Recurrent neural networks</subject><subject>recurrent neural networks (RNNs)</subject><subject>Skeleton</subject><subject>Sports</subject><subject>Streaming media</subject><subject>Three-dimensional displays</subject><subject>Two dimensional displays</subject><subject>Video data</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1PAjEQhhujiYjeTbxs4rk47bbb7RFQ1AQwgfXclDKrC7jF7aLx31uEeJo5PO98PIRcM-gxBvqumEx6HJjucZ2BzPMT0mFaMAqg1GnsJQeqOYNzchHCCoAJCapDRpzeJ_M1brD1NR3YgMuk79rK18kMnX-rq7_-q7JJ8e3poLG1e0_mrXXrSI7nxYTOptNwSc5Kuwl4daxd8jp6KIZPdPzy-Dzsj6njmrV0mTORZU6gkjwTLHcaFlqgA4uidLkqZTxLpsiVUFbBAhwKWGQOl5ajEmXaJbeHudvGf-4wtGbld00dVxouxP5xKXWk4EC5xofQYGm2TfVhmx_DwOxtmWjL7G2Zo60YuTlEKkT8x3OdZmmq0194RWNL</recordid><startdate>20201001</startdate><enddate>20201001</enddate><creator>Avola, Danilo</creator><creator>Cascio, Marco</creator><creator>Cinque, Luigi</creator><creator>Foresti, Gian Luca</creator><creator>Massaroni, Cristiano</creator><creator>Rodola, Emanuele</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0001-9437-6217</orcidid><orcidid>https://orcid.org/0000-0002-8425-6892</orcidid><orcidid>https://orcid.org/0000-0002-6942-4851</orcidid><orcidid>https://orcid.org/0000-0003-0091-7241</orcidid></search><sort><creationdate>20201001</creationdate><title>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</title><author>Avola, Danilo ; Cascio, Marco ; Cinque, Luigi ; Foresti, Gian Luca ; Massaroni, Cristiano ; Rodola, Emanuele</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-d81466c4e7526418c90b94ec0ae4fc87f514553e2747a70b0ce40b6ceda2e74f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>2D skeleton</topic><topic>Action recognition</topic><topic>Artificial neural networks</topic><topic>Cameras</topic><topic>Computer vision</topic><topic>Datasets</topic><topic>Feature extraction</topic><topic>Indoor environments</topic><topic>long short-term memory (LSTM)</topic><topic>Neural networks</topic><topic>Recognition</topic><topic>Recurrent neural networks</topic><topic>recurrent neural networks (RNNs)</topic><topic>Skeleton</topic><topic>Sports</topic><topic>Streaming media</topic><topic>Three-dimensional displays</topic><topic>Two dimensional displays</topic><topic>Video data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Avola, Danilo</creatorcontrib><creatorcontrib>Cascio, Marco</creatorcontrib><creatorcontrib>Cinque, Luigi</creatorcontrib><creatorcontrib>Foresti, Gian Luca</creatorcontrib><creatorcontrib>Massaroni, Cristiano</creatorcontrib><creatorcontrib>Rodola, Emanuele</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE/IET Electronic Library</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Avola, Danilo</au><au>Cascio, Marco</au><au>Cinque, Luigi</au><au>Foresti, Gian Luca</au><au>Massaroni, Cristiano</au><au>Rodola, Emanuele</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2020-10-01</date><risdate>2020</risdate><volume>22</volume><issue>10</issue><spage>2481</spage><epage>2496</epage><pages>2481-2496</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Action recognition in video sequences is an interesting field for many computer vision applications, including behavior analysis, event recognition, and video surveillance. In this article, a method based on 2D skeleton and two-branch stacked Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) cells is proposed. Unlike 3D skeletons, usually generated by RGB-D cameras, the 2D skeletons adopted in this article are reconstructed starting from RGB video streams, therefore allowing the use of the proposed approach in both indoor and outdoor environments. Moreover, any case of missing skeletal data is managed by exploiting 3D-Convolutional Neural Networks (3D-CNNs). Comparative experiments with several key works on KTH and Weizmann datasets show that the method described in this paper outperforms the current state-of-the-art. Additional experiments on UCF Sports and IXMAS datasets demonstrate the effectiveness of our method in the presence of noisy data and perspective changes, respectively. Further investigations on UCF Sports, HMDB51, UCF101, and Kinetics400 highlight how the combination between the proposed two-branch stacked LSTM and the 3D-CNN-based network can manage missing skeleton information, greatly improving the overall accuracy. Moreover, additional tests on KTH and UCF Sports datasets also show the robustness of our approach in the presence of partial body occlusions. Finally, comparisons on UT-Kinect and NTU-RGB+D datasets show that the accuracy of the proposed method is fully comparable to that of works based on 3D skeletons.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2019.2960588</doi><tpages>16</tpages><orcidid>https://orcid.org/0000-0001-9437-6217</orcidid><orcidid>https://orcid.org/0000-0002-8425-6892</orcidid><orcidid>https://orcid.org/0000-0002-6942-4851</orcidid><orcidid>https://orcid.org/0000-0003-0091-7241</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1520-9210 |
ispartof | IEEE transactions on multimedia, 2020-10, Vol.22 (10), p.2481-2496 |
issn | 1520-9210 1941-0077 |
language | eng |
recordid | cdi_proquest_journals_2446058559 |
source | IEEE/IET Electronic Library |
subjects | 2D skeleton Action recognition Artificial neural networks Cameras Computer vision Datasets Feature extraction Indoor environments long short-term memory (LSTM) Neural networks Recognition Recurrent neural networks recurrent neural networks (RNNs) Skeleton Sports Streaming media Three-dimensional displays Two dimensional displays Video data |
title | 2-D Skeleton-Based Action Recognition via Two-Branch Stacked LSTM-RNNs |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-10T14%3A55%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=2-D%20Skeleton-Based%20Action%20Recognition%20via%20Two-Branch%20Stacked%20LSTM-RNNs&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Avola,%20Danilo&rft.date=2020-10-01&rft.volume=22&rft.issue=10&rft.spage=2481&rft.epage=2496&rft.pages=2481-2496&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2019.2960588&rft_dat=%3Cproquest_RIE%3E2446058559%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2446058559&rft_id=info:pmid/&rft_ieee_id=8936339&rfr_iscdi=true |