XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition

Recognizing human actions from skeletal data is a major challenge, as it does not always deliver optimal performance due to the limited ability to discern the spatio-temporal patterns inherent in skeletal data. This study aims to enhance the precision of action recognition by conceptualizing each ac...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Signal, image and video processing image and video processing, 2024-11, Vol.18 (11), p.7857-7871
Hauptverfasser: Elaoud, Amani, Ghazouani, Haythem, Barhoumi, Walid
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 7871
container_issue 11
container_start_page 7857
container_title Signal, image and video processing
container_volume 18
creator Elaoud, Amani
Ghazouani, Haythem
Barhoumi, Walid
description Recognizing human actions from skeletal data is a major challenge, as it does not always deliver optimal performance due to the limited ability to discern the spatio-temporal patterns inherent in skeletal data. This study aims to enhance the precision of action recognition by conceptualizing each action as a 3D matrix, accurately capturing spatio-temporal dynamics within images. These matrices offer a comprehensive encapsulation of the dynamic evolution of skeletal joint coordinates ( x , y , and z ) over time, affording a holistic comprehension of human actions. Using these 3D matrices as three-channel images enables us to capture the rich spatio-temporal information they contain. The suggested XYZ-channel action encoding facilitates the application of data augmentation techniques, thereby enhancing model generalization and robustness. Furthermore, we present a customized CNN architecture designed to efficiently extract spatiotemporal features from actions coded on the XYZ channel and classify them accurately. Extensive experiments on diverse datasets; including MSR Action3D, UTD-MAD and CZU-MHAD; demonstrate the effectiveness of the proposed CNN architecture. We achieve a test set accuracy of 96% on the MSR Action3D dataset, 97.9% on the UTD-MAD dataset and 98% on the CZU-MHAD datatset, underlining the method’s ability to accurately recognize human actions from skeletal data in challenging scenarios.
doi_str_mv 10.1007/s11760-024-03434-4
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3104476141</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3104476141</sourcerecordid><originalsourceid>FETCH-LOGICAL-c200t-62f312d68628a7240b634fb2301db0928f46d40986ce67fe2409e07c5b45a5b83</originalsourceid><addsrcrecordid>eNp9kEtLAzEUhQdRsNT-AVcB19GbRzPpUoovKLhRUDchk0mm00dSk8zCf2_aiu68m3u4nO9cOFV1SeCaANQ3iZBaAAbKMTDOOOYn1YhIwTCpCTn91cDOq0lKKyjDaC2FHFXrt_cPbJbae7tB1pvQ9r5D2rdID93W-qxzHzwKDi2HrfZoFXqfUVrbjc3lbkKIhdDZJuRCLAktzgHbPW8OZLQmdL7f64vqzOlNspOfPa5e7-9e5o948fzwNL9dYEMBMhbUMUJbIQWVuqYcGsG4aygD0jYwo9Jx0XKYSWGsqJ0tjpmF2kwbPtXTRrJxdXXM3cXwOdiU1SoM0ZeXihHgvBaEk-KiR5eJIaVondrFfqvjlyKg9r2qY6-q9KoOvSpeIHaEUjH7zsa_6H-ob1JXeso</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3104476141</pqid></control><display><type>article</type><title>XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition</title><source>SpringerLink Journals - AutoHoldings</source><creator>Elaoud, Amani ; Ghazouani, Haythem ; Barhoumi, Walid</creator><creatorcontrib>Elaoud, Amani ; Ghazouani, Haythem ; Barhoumi, Walid</creatorcontrib><description>Recognizing human actions from skeletal data is a major challenge, as it does not always deliver optimal performance due to the limited ability to discern the spatio-temporal patterns inherent in skeletal data. This study aims to enhance the precision of action recognition by conceptualizing each action as a 3D matrix, accurately capturing spatio-temporal dynamics within images. These matrices offer a comprehensive encapsulation of the dynamic evolution of skeletal joint coordinates ( x , y , and z ) over time, affording a holistic comprehension of human actions. Using these 3D matrices as three-channel images enables us to capture the rich spatio-temporal information they contain. The suggested XYZ-channel action encoding facilitates the application of data augmentation techniques, thereby enhancing model generalization and robustness. Furthermore, we present a customized CNN architecture designed to efficiently extract spatiotemporal features from actions coded on the XYZ channel and classify them accurately. Extensive experiments on diverse datasets; including MSR Action3D, UTD-MAD and CZU-MHAD; demonstrate the effectiveness of the proposed CNN architecture. We achieve a test set accuracy of 96% on the MSR Action3D dataset, 97.9% on the UTD-MAD dataset and 98% on the CZU-MHAD datatset, underlining the method’s ability to accurately recognize human actions from skeletal data in challenging scenarios.</description><identifier>ISSN: 1863-1703</identifier><identifier>EISSN: 1863-1711</identifier><identifier>DOI: 10.1007/s11760-024-03434-4</identifier><language>eng</language><publisher>London: Springer London</publisher><subject>Activity recognition ; Coding ; Computer Imaging ; Computer Science ; Data augmentation ; Datasets ; Human performance ; Image enhancement ; Image Processing and Computer Vision ; Multimedia Information Systems ; Original Paper ; Pattern recognition ; Pattern Recognition and Graphics ; Signal,Image and Speech Processing ; Spatiotemporal data ; Vision</subject><ispartof>Signal, image and video processing, 2024-11, Vol.18 (11), p.7857-7871</ispartof><rights>The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c200t-62f312d68628a7240b634fb2301db0928f46d40986ce67fe2409e07c5b45a5b83</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s11760-024-03434-4$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s11760-024-03434-4$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Elaoud, Amani</creatorcontrib><creatorcontrib>Ghazouani, Haythem</creatorcontrib><creatorcontrib>Barhoumi, Walid</creatorcontrib><title>XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition</title><title>Signal, image and video processing</title><addtitle>SIViP</addtitle><description>Recognizing human actions from skeletal data is a major challenge, as it does not always deliver optimal performance due to the limited ability to discern the spatio-temporal patterns inherent in skeletal data. This study aims to enhance the precision of action recognition by conceptualizing each action as a 3D matrix, accurately capturing spatio-temporal dynamics within images. These matrices offer a comprehensive encapsulation of the dynamic evolution of skeletal joint coordinates ( x , y , and z ) over time, affording a holistic comprehension of human actions. Using these 3D matrices as three-channel images enables us to capture the rich spatio-temporal information they contain. The suggested XYZ-channel action encoding facilitates the application of data augmentation techniques, thereby enhancing model generalization and robustness. Furthermore, we present a customized CNN architecture designed to efficiently extract spatiotemporal features from actions coded on the XYZ channel and classify them accurately. Extensive experiments on diverse datasets; including MSR Action3D, UTD-MAD and CZU-MHAD; demonstrate the effectiveness of the proposed CNN architecture. We achieve a test set accuracy of 96% on the MSR Action3D dataset, 97.9% on the UTD-MAD dataset and 98% on the CZU-MHAD datatset, underlining the method’s ability to accurately recognize human actions from skeletal data in challenging scenarios.</description><subject>Activity recognition</subject><subject>Coding</subject><subject>Computer Imaging</subject><subject>Computer Science</subject><subject>Data augmentation</subject><subject>Datasets</subject><subject>Human performance</subject><subject>Image enhancement</subject><subject>Image Processing and Computer Vision</subject><subject>Multimedia Information Systems</subject><subject>Original Paper</subject><subject>Pattern recognition</subject><subject>Pattern Recognition and Graphics</subject><subject>Signal,Image and Speech Processing</subject><subject>Spatiotemporal data</subject><subject>Vision</subject><issn>1863-1703</issn><issn>1863-1711</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kEtLAzEUhQdRsNT-AVcB19GbRzPpUoovKLhRUDchk0mm00dSk8zCf2_aiu68m3u4nO9cOFV1SeCaANQ3iZBaAAbKMTDOOOYn1YhIwTCpCTn91cDOq0lKKyjDaC2FHFXrt_cPbJbae7tB1pvQ9r5D2rdID93W-qxzHzwKDi2HrfZoFXqfUVrbjc3lbkKIhdDZJuRCLAktzgHbPW8OZLQmdL7f64vqzOlNspOfPa5e7-9e5o948fzwNL9dYEMBMhbUMUJbIQWVuqYcGsG4aygD0jYwo9Jx0XKYSWGsqJ0tjpmF2kwbPtXTRrJxdXXM3cXwOdiU1SoM0ZeXihHgvBaEk-KiR5eJIaVondrFfqvjlyKg9r2qY6-q9KoOvSpeIHaEUjH7zsa_6H-ob1JXeso</recordid><startdate>20241101</startdate><enddate>20241101</enddate><creator>Elaoud, Amani</creator><creator>Ghazouani, Haythem</creator><creator>Barhoumi, Walid</creator><general>Springer London</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>20241101</creationdate><title>XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition</title><author>Elaoud, Amani ; Ghazouani, Haythem ; Barhoumi, Walid</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c200t-62f312d68628a7240b634fb2301db0928f46d40986ce67fe2409e07c5b45a5b83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Activity recognition</topic><topic>Coding</topic><topic>Computer Imaging</topic><topic>Computer Science</topic><topic>Data augmentation</topic><topic>Datasets</topic><topic>Human performance</topic><topic>Image enhancement</topic><topic>Image Processing and Computer Vision</topic><topic>Multimedia Information Systems</topic><topic>Original Paper</topic><topic>Pattern recognition</topic><topic>Pattern Recognition and Graphics</topic><topic>Signal,Image and Speech Processing</topic><topic>Spatiotemporal data</topic><topic>Vision</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Elaoud, Amani</creatorcontrib><creatorcontrib>Ghazouani, Haythem</creatorcontrib><creatorcontrib>Barhoumi, Walid</creatorcontrib><collection>CrossRef</collection><jtitle>Signal, image and video processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Elaoud, Amani</au><au>Ghazouani, Haythem</au><au>Barhoumi, Walid</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition</atitle><jtitle>Signal, image and video processing</jtitle><stitle>SIViP</stitle><date>2024-11-01</date><risdate>2024</risdate><volume>18</volume><issue>11</issue><spage>7857</spage><epage>7871</epage><pages>7857-7871</pages><issn>1863-1703</issn><eissn>1863-1711</eissn><abstract>Recognizing human actions from skeletal data is a major challenge, as it does not always deliver optimal performance due to the limited ability to discern the spatio-temporal patterns inherent in skeletal data. This study aims to enhance the precision of action recognition by conceptualizing each action as a 3D matrix, accurately capturing spatio-temporal dynamics within images. These matrices offer a comprehensive encapsulation of the dynamic evolution of skeletal joint coordinates ( x , y , and z ) over time, affording a holistic comprehension of human actions. Using these 3D matrices as three-channel images enables us to capture the rich spatio-temporal information they contain. The suggested XYZ-channel action encoding facilitates the application of data augmentation techniques, thereby enhancing model generalization and robustness. Furthermore, we present a customized CNN architecture designed to efficiently extract spatiotemporal features from actions coded on the XYZ channel and classify them accurately. Extensive experiments on diverse datasets; including MSR Action3D, UTD-MAD and CZU-MHAD; demonstrate the effectiveness of the proposed CNN architecture. We achieve a test set accuracy of 96% on the MSR Action3D dataset, 97.9% on the UTD-MAD dataset and 98% on the CZU-MHAD datatset, underlining the method’s ability to accurately recognize human actions from skeletal data in challenging scenarios.</abstract><cop>London</cop><pub>Springer London</pub><doi>10.1007/s11760-024-03434-4</doi><tpages>15</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1863-1703
ispartof Signal, image and video processing, 2024-11, Vol.18 (11), p.7857-7871
issn 1863-1703
1863-1711
language eng
recordid cdi_proquest_journals_3104476141
source SpringerLink Journals - AutoHoldings
subjects Activity recognition
Coding
Computer Imaging
Computer Science
Data augmentation
Datasets
Human performance
Image enhancement
Image Processing and Computer Vision
Multimedia Information Systems
Original Paper
Pattern recognition
Pattern Recognition and Graphics
Signal,Image and Speech Processing
Spatiotemporal data
Vision
title XYZ-channel encoding and augmentation of human joint skeleton coordinates for end-to-end action recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-24T18%3A17%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=XYZ-channel%20encoding%20and%20augmentation%20of%20human%20joint%20skeleton%20coordinates%20for%20end-to-end%20action%20recognition&rft.jtitle=Signal,%20image%20and%20video%20processing&rft.au=Elaoud,%20Amani&rft.date=2024-11-01&rft.volume=18&rft.issue=11&rft.spage=7857&rft.epage=7871&rft.pages=7857-7871&rft.issn=1863-1703&rft.eissn=1863-1711&rft_id=info:doi/10.1007/s11760-024-03434-4&rft_dat=%3Cproquest_cross%3E3104476141%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3104476141&rft_id=info:pmid/&rfr_iscdi=true