XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition
Recognizing human actions from skeletal data is a major challenge, as it does not always deliver optimal performance due to the limited ability to discern the spatio-temporal patterns inherent in skeletal data. This study aims to enhance the precision of action recognition by conceptualizing each ac...
Gespeichert in:
Veröffentlicht in: | Signal, image and video processing image and video processing, 2024-11, Vol.18 (11), p.7857-7871 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 7871 |
---|---|
container_issue | 11 |
container_start_page | 7857 |
container_title | Signal, image and video processing |
container_volume | 18 |
creator | Elaoud, Amani Ghazouani, Haythem Barhoumi, Walid |
description | Recognizing human actions from skeletal data is a major challenge, as it does not always deliver optimal performance due to the limited ability to discern the spatio-temporal patterns inherent in skeletal data. This study aims to enhance the precision of action recognition by conceptualizing each action as a 3D matrix, accurately capturing spatio-temporal dynamics within images. These matrices offer a comprehensive encapsulation of the dynamic evolution of skeletal joint coordinates (
x
,
y
, and
z
) over time, affording a holistic comprehension of human actions. Using these 3D matrices as three-channel images enables us to capture the rich spatio-temporal information they contain. The suggested XYZ-channel action encoding facilitates the application of data augmentation techniques, thereby enhancing model generalization and robustness. Furthermore, we present a customized CNN architecture designed to efficiently extract spatiotemporal features from actions coded on the XYZ channel and classify them accurately. Extensive experiments on diverse datasets; including MSR Action3D, UTD-MAD and CZU-MHAD; demonstrate the effectiveness of the proposed CNN architecture. We achieve a test set accuracy of 96% on the MSR Action3D dataset, 97.9% on the UTD-MAD dataset and 98% on the CZU-MHAD datatset, underlining the method’s ability to accurately recognize human actions from skeletal data in challenging scenarios. |
doi_str_mv | 10.1007/s11760-024-03434-4 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3104476141</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3104476141</sourcerecordid><originalsourceid>FETCH-LOGICAL-c200t-62f312d68628a7240b634fb2301db0928f46d40986ce67fe2409e07c5b45a5b83</originalsourceid><addsrcrecordid>eNp9kEtLAzEUhQdRsNT-AVcB19GbRzPpUoovKLhRUDchk0mm00dSk8zCf2_aiu68m3u4nO9cOFV1SeCaANQ3iZBaAAbKMTDOOOYn1YhIwTCpCTn91cDOq0lKKyjDaC2FHFXrt_cPbJbae7tB1pvQ9r5D2rdID93W-qxzHzwKDi2HrfZoFXqfUVrbjc3lbkKIhdDZJuRCLAktzgHbPW8OZLQmdL7f64vqzOlNspOfPa5e7-9e5o948fzwNL9dYEMBMhbUMUJbIQWVuqYcGsG4aygD0jYwo9Jx0XKYSWGsqJ0tjpmF2kwbPtXTRrJxdXXM3cXwOdiU1SoM0ZeXihHgvBaEk-KiR5eJIaVondrFfqvjlyKg9r2qY6-q9KoOvSpeIHaEUjH7zsa_6H-ob1JXeso</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3104476141</pqid></control><display><type>article</type><title>XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition</title><source>SpringerLink Journals - AutoHoldings</source><creator>Elaoud, Amani ; Ghazouani, Haythem ; Barhoumi, Walid</creator><creatorcontrib>Elaoud, Amani ; Ghazouani, Haythem ; Barhoumi, Walid</creatorcontrib><description>Recognizing human actions from skeletal data is a major challenge, as it does not always deliver optimal performance due to the limited ability to discern the spatio-temporal patterns inherent in skeletal data. This study aims to enhance the precision of action recognition by conceptualizing each action as a 3D matrix, accurately capturing spatio-temporal dynamics within images. These matrices offer a comprehensive encapsulation of the dynamic evolution of skeletal joint coordinates (
x
,
y
, and
z
) over time, affording a holistic comprehension of human actions. Using these 3D matrices as three-channel images enables us to capture the rich spatio-temporal information they contain. The suggested XYZ-channel action encoding facilitates the application of data augmentation techniques, thereby enhancing model generalization and robustness. Furthermore, we present a customized CNN architecture designed to efficiently extract spatiotemporal features from actions coded on the XYZ channel and classify them accurately. Extensive experiments on diverse datasets; including MSR Action3D, UTD-MAD and CZU-MHAD; demonstrate the effectiveness of the proposed CNN architecture. We achieve a test set accuracy of 96% on the MSR Action3D dataset, 97.9% on the UTD-MAD dataset and 98% on the CZU-MHAD datatset, underlining the method’s ability to accurately recognize human actions from skeletal data in challenging scenarios.</description><identifier>ISSN: 1863-1703</identifier><identifier>EISSN: 1863-1711</identifier><identifier>DOI: 10.1007/s11760-024-03434-4</identifier><language>eng</language><publisher>London: Springer London</publisher><subject>Activity recognition ; Coding ; Computer Imaging ; Computer Science ; Data augmentation ; Datasets ; Human performance ; Image enhancement ; Image Processing and Computer Vision ; Multimedia Information Systems ; Original Paper ; Pattern recognition ; Pattern Recognition and Graphics ; Signal,Image and Speech Processing ; Spatiotemporal data ; Vision</subject><ispartof>Signal, image and video processing, 2024-11, Vol.18 (11), p.7857-7871</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c200t-62f312d68628a7240b634fb2301db0928f46d40986ce67fe2409e07c5b45a5b83</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11760-024-03434-4$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11760-024-03434-4$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Elaoud, Amani</creatorcontrib><creatorcontrib>Ghazouani, Haythem</creatorcontrib><creatorcontrib>Barhoumi, Walid</creatorcontrib><title>XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition</title><title>Signal, image and video processing</title><addtitle>SIViP</addtitle><description>Recognizing human actions from skeletal data is a major challenge, as it does not always deliver optimal performance due to the limited ability to discern the spatio-temporal patterns inherent in skeletal data. This study aims to enhance the precision of action recognition by conceptualizing each action as a 3D matrix, accurately capturing spatio-temporal dynamics within images. These matrices offer a comprehensive encapsulation of the dynamic evolution of skeletal joint coordinates (
x
,
y
, and
z
) over time, affording a holistic comprehension of human actions. Using these 3D matrices as three-channel images enables us to capture the rich spatio-temporal information they contain. The suggested XYZ-channel action encoding facilitates the application of data augmentation techniques, thereby enhancing model generalization and robustness. Furthermore, we present a customized CNN architecture designed to efficiently extract spatiotemporal features from actions coded on the XYZ channel and classify them accurately. Extensive experiments on diverse datasets; including MSR Action3D, UTD-MAD and CZU-MHAD; demonstrate the effectiveness of the proposed CNN architecture. We achieve a test set accuracy of 96% on the MSR Action3D dataset, 97.9% on the UTD-MAD dataset and 98% on the CZU-MHAD datatset, underlining the method’s ability to accurately recognize human actions from skeletal data in challenging scenarios.</description><subject>Activity recognition</subject><subject>Coding</subject><subject>Computer Imaging</subject><subject>Computer Science</subject><subject>Data augmentation</subject><subject>Datasets</subject><subject>Human performance</subject><subject>Image enhancement</subject><subject>Image Processing and Computer Vision</subject><subject>Multimedia Information Systems</subject><subject>Original Paper</subject><subject>Pattern recognition</subject><subject>Pattern Recognition and Graphics</subject><subject>Signal,Image and Speech Processing</subject><subject>Spatiotemporal data</subject><subject>Vision</subject><issn>1863-1703</issn><issn>1863-1711</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kEtLAzEUhQdRsNT-AVcB19GbRzPpUoovKLhRUDchk0mm00dSk8zCf2_aiu68m3u4nO9cOFV1SeCaANQ3iZBaAAbKMTDOOOYn1YhIwTCpCTn91cDOq0lKKyjDaC2FHFXrt_cPbJbae7tB1pvQ9r5D2rdID93W-qxzHzwKDi2HrfZoFXqfUVrbjc3lbkKIhdDZJuRCLAktzgHbPW8OZLQmdL7f64vqzOlNspOfPa5e7-9e5o948fzwNL9dYEMBMhbUMUJbIQWVuqYcGsG4aygD0jYwo9Jx0XKYSWGsqJ0tjpmF2kwbPtXTRrJxdXXM3cXwOdiU1SoM0ZeXihHgvBaEk-KiR5eJIaVondrFfqvjlyKg9r2qY6-q9KoOvSpeIHaEUjH7zsa_6H-ob1JXeso</recordid><startdate>20241101</startdate><enddate>20241101</enddate><creator>Elaoud, Amani</creator><creator>Ghazouani, Haythem</creator><creator>Barhoumi, Walid</creator><general>Springer London</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20241101</creationdate><title>XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition</title><author>Elaoud, Amani ; Ghazouani, Haythem ; Barhoumi, Walid</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c200t-62f312d68628a7240b634fb2301db0928f46d40986ce67fe2409e07c5b45a5b83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Activity recognition</topic><topic>Coding</topic><topic>Computer Imaging</topic><topic>Computer Science</topic><topic>Data augmentation</topic><topic>Datasets</topic><topic>Human performance</topic><topic>Image enhancement</topic><topic>Image Processing and Computer Vision</topic><topic>Multimedia Information Systems</topic><topic>Original Paper</topic><topic>Pattern recognition</topic><topic>Pattern Recognition and Graphics</topic><topic>Signal,Image and Speech Processing</topic><topic>Spatiotemporal data</topic><topic>Vision</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Elaoud, Amani</creatorcontrib><creatorcontrib>Ghazouani, Haythem</creatorcontrib><creatorcontrib>Barhoumi, Walid</creatorcontrib><collection>CrossRef</collection><jtitle>Signal, image and video processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Elaoud, Amani</au><au>Ghazouani, Haythem</au><au>Barhoumi, Walid</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition</atitle><jtitle>Signal, image and video processing</jtitle><stitle>SIViP</stitle><date>2024-11-01</date><risdate>2024</risdate><volume>18</volume><issue>11</issue><spage>7857</spage><epage>7871</epage><pages>7857-7871</pages><issn>1863-1703</issn><eissn>1863-1711</eissn><abstract>Recognizing human actions from skeletal data is a major challenge, as it does not always deliver optimal performance due to the limited ability to discern the spatio-temporal patterns inherent in skeletal data. This study aims to enhance the precision of action recognition by conceptualizing each action as a 3D matrix, accurately capturing spatio-temporal dynamics within images. These matrices offer a comprehensive encapsulation of the dynamic evolution of skeletal joint coordinates (
x
,
y
, and
z
) over time, affording a holistic comprehension of human actions. Using these 3D matrices as three-channel images enables us to capture the rich spatio-temporal information they contain. The suggested XYZ-channel action encoding facilitates the application of data augmentation techniques, thereby enhancing model generalization and robustness. Furthermore, we present a customized CNN architecture designed to efficiently extract spatiotemporal features from actions coded on the XYZ channel and classify them accurately. Extensive experiments on diverse datasets; including MSR Action3D, UTD-MAD and CZU-MHAD; demonstrate the effectiveness of the proposed CNN architecture. We achieve a test set accuracy of 96% on the MSR Action3D dataset, 97.9% on the UTD-MAD dataset and 98% on the CZU-MHAD datatset, underlining the method’s ability to accurately recognize human actions from skeletal data in challenging scenarios.</abstract><cop>London</cop><pub>Springer London</pub><doi>10.1007/s11760-024-03434-4</doi><tpages>15</tpages></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1863-1703 |
ispartof | Signal, image and video processing, 2024-11, Vol.18 (11), p.7857-7871 |
issn | 1863-1703 1863-1711 |
language | eng |
recordid | cdi_proquest_journals_3104476141 |
source | SpringerLink Journals - AutoHoldings |
subjects | Activity recognition Coding Computer Imaging Computer Science Data augmentation Datasets Human performance Image enhancement Image Processing and Computer Vision Multimedia Information Systems Original Paper Pattern recognition Pattern Recognition and Graphics Signal,Image and Speech Processing Spatiotemporal data Vision |
title | XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T18%3A17%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=XYZ-channel%20encoding%20and%20augmentation%20of%20human%20joint%20skeleton%20coordinates%20for%20end-to-end%20action%20recognition&rft.jtitle=Signal,%20image%20and%20video%20processing&rft.au=Elaoud,%20Amani&rft.date=2024-11-01&rft.volume=18&rft.issue=11&rft.spage=7857&rft.epage=7871&rft.pages=7857-7871&rft.issn=1863-1703&rft.eissn=1863-1711&rft_id=info:doi/10.1007/s11760-024-03434-4&rft_dat=%3Cproquest_cross%3E3104476141%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3104476141&rft_id=info:pmid/&rfr_iscdi=true |