Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble
Sign language is commonly used by deaf or mute people to communicate but requires extensive effort to master. It is usually performed with the fast yet delicate movement of hand gestures, body posture, and even facial expressions. Current Sign Language Recognition (SLR) methods usually extract featu...
Gespeichert in:
Hauptverfasser: | , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Jiang, Songyao Sun, Bin Wang, Lichen Bai, Yue Li, Kunpeng Fu, Yun |
description | Sign language is commonly used by deaf or mute people to communicate but
requires extensive effort to master. It is usually performed with the fast yet
delicate movement of hand gestures, body posture, and even facial expressions.
Current Sign Language Recognition (SLR) methods usually extract features via
deep neural networks and suffer overfitting due to limited and noisy data.
Recently, skeleton-based action recognition has attracted increasing attention
due to its subject-invariant and background-invariant nature, whereas
skeleton-based SLR is still under exploration due to the lack of hand
annotations. Some researchers have tried to use off-line hand pose trackers to
obtain hand keypoints and aid in recognizing sign language via recurrent neural
networks. Nevertheless, none of them outperforms RGB-based approaches yet. To
this end, we propose a novel Skeleton Aware Multi-modal Framework with a Global
Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse
multi-modal feature representations towards a higher recognition rate.
Specifically, we propose a Sign Language Graph Convolution Network (SL-GCN) to
model the embedded dynamics of skeleton keypoints and a Separable
Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The
skeleton-based predictions are fused with other RGB and depth based modalities
by the proposed late-fusion GEM to provide global information and make a
faithful SLR prediction. Experiments on three isolated SLR datasets demonstrate
that our proposed SAM-SLR-v2 framework is exceedingly effective and achieves
state-of-the-art performance with significant margins. Our code will be
available at https://github.com/jackyjsy/SAM-SLR-v2 |
doi_str_mv | 10.48550/arxiv.2110.06161 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2110_06161</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2110_06161</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-6992f67b975a692f32605d84c5653124d949d324560c520a6ec63109e8695dec3</originalsourceid><addsrcrecordid>eNotz8tOwzAUBFBvWKDCB7DCP-Di5028jKrykFIh0e6jW_s2snAdlKYF_p5SWM1oFiMdxu6UnNvaOfmA41c6zbU6DxIUqGvWrFNfeIulP2JP_I3C0Jc0paHwU0K-fqdM01BE84kj8dUxT0mshkiZL8uB9ttMN-xqh_lAt_85Y5vH5WbxLNrXp5dF0wqESgnwXu-g2vrKIZyr0SBdrG1w4IzSNnrro9HWgQxOSwQKYJT0VIN3kYKZsfu_24uh-xjTHsfv7tfSXSzmB6ofQjc</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble</title><source>arXiv.org</source><creator>Jiang, Songyao ; Sun, Bin ; Wang, Lichen ; Bai, Yue ; Li, Kunpeng ; Fu, Yun</creator><creatorcontrib>Jiang, Songyao ; Sun, Bin ; Wang, Lichen ; Bai, Yue ; Li, Kunpeng ; Fu, Yun</creatorcontrib><description>Sign language is commonly used by deaf or mute people to communicate but
requires extensive effort to master. It is usually performed with the fast yet
delicate movement of hand gestures, body posture, and even facial expressions.
Current Sign Language Recognition (SLR) methods usually extract features via
deep neural networks and suffer overfitting due to limited and noisy data.
Recently, skeleton-based action recognition has attracted increasing attention
due to its subject-invariant and background-invariant nature, whereas
skeleton-based SLR is still under exploration due to the lack of hand
annotations. Some researchers have tried to use off-line hand pose trackers to
obtain hand keypoints and aid in recognizing sign language via recurrent neural
networks. Nevertheless, none of them outperforms RGB-based approaches yet. To
this end, we propose a novel Skeleton Aware Multi-modal Framework with a Global
Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse
multi-modal feature representations towards a higher recognition rate.
Specifically, we propose a Sign Language Graph Convolution Network (SL-GCN) to
model the embedded dynamics of skeleton keypoints and a Separable
Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The
skeleton-based predictions are fused with other RGB and depth based modalities
by the proposed late-fusion GEM to provide global information and make a
faithful SLR prediction. Experiments on three isolated SLR datasets demonstrate
that our proposed SAM-SLR-v2 framework is exceedingly effective and achieves
state-of-the-art performance with significant margins. Our code will be
available at https://github.com/jackyjsy/SAM-SLR-v2</description><identifier>DOI: 10.48550/arxiv.2110.06161</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2021-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2110.06161$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2110.06161$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jiang, Songyao</creatorcontrib><creatorcontrib>Sun, Bin</creatorcontrib><creatorcontrib>Wang, Lichen</creatorcontrib><creatorcontrib>Bai, Yue</creatorcontrib><creatorcontrib>Li, Kunpeng</creatorcontrib><creatorcontrib>Fu, Yun</creatorcontrib><title>Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble</title><description>Sign language is commonly used by deaf or mute people to communicate but
requires extensive effort to master. It is usually performed with the fast yet
delicate movement of hand gestures, body posture, and even facial expressions.
Current Sign Language Recognition (SLR) methods usually extract features via
deep neural networks and suffer overfitting due to limited and noisy data.
Recently, skeleton-based action recognition has attracted increasing attention
due to its subject-invariant and background-invariant nature, whereas
skeleton-based SLR is still under exploration due to the lack of hand
annotations. Some researchers have tried to use off-line hand pose trackers to
obtain hand keypoints and aid in recognizing sign language via recurrent neural
networks. Nevertheless, none of them outperforms RGB-based approaches yet. To
this end, we propose a novel Skeleton Aware Multi-modal Framework with a Global
Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse
multi-modal feature representations towards a higher recognition rate.
Specifically, we propose a Sign Language Graph Convolution Network (SL-GCN) to
model the embedded dynamics of skeleton keypoints and a Separable
Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The
skeleton-based predictions are fused with other RGB and depth based modalities
by the proposed late-fusion GEM to provide global information and make a
faithful SLR prediction. Experiments on three isolated SLR datasets demonstrate
that our proposed SAM-SLR-v2 framework is exceedingly effective and achieves
state-of-the-art performance with significant margins. Our code will be
available at https://github.com/jackyjsy/SAM-SLR-v2</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz8tOwzAUBFBvWKDCB7DCP-Di5028jKrykFIh0e6jW_s2snAdlKYF_p5SWM1oFiMdxu6UnNvaOfmA41c6zbU6DxIUqGvWrFNfeIulP2JP_I3C0Jc0paHwU0K-fqdM01BE84kj8dUxT0mshkiZL8uB9ttMN-xqh_lAt_85Y5vH5WbxLNrXp5dF0wqESgnwXu-g2vrKIZyr0SBdrG1w4IzSNnrro9HWgQxOSwQKYJT0VIN3kYKZsfu_24uh-xjTHsfv7tfSXSzmB6ofQjc</recordid><startdate>20211012</startdate><enddate>20211012</enddate><creator>Jiang, Songyao</creator><creator>Sun, Bin</creator><creator>Wang, Lichen</creator><creator>Bai, Yue</creator><creator>Li, Kunpeng</creator><creator>Fu, Yun</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20211012</creationdate><title>Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble</title><author>Jiang, Songyao ; Sun, Bin ; Wang, Lichen ; Bai, Yue ; Li, Kunpeng ; Fu, Yun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-6992f67b975a692f32605d84c5653124d949d324560c520a6ec63109e8695dec3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Jiang, Songyao</creatorcontrib><creatorcontrib>Sun, Bin</creatorcontrib><creatorcontrib>Wang, Lichen</creatorcontrib><creatorcontrib>Bai, Yue</creatorcontrib><creatorcontrib>Li, Kunpeng</creatorcontrib><creatorcontrib>Fu, Yun</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jiang, Songyao</au><au>Sun, Bin</au><au>Wang, Lichen</au><au>Bai, Yue</au><au>Li, Kunpeng</au><au>Fu, Yun</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble</atitle><date>2021-10-12</date><risdate>2021</risdate><abstract>Sign language is commonly used by deaf or mute people to communicate but
requires extensive effort to master. It is usually performed with the fast yet
delicate movement of hand gestures, body posture, and even facial expressions.
Current Sign Language Recognition (SLR) methods usually extract features via
deep neural networks and suffer overfitting due to limited and noisy data.
Recently, skeleton-based action recognition has attracted increasing attention
due to its subject-invariant and background-invariant nature, whereas
skeleton-based SLR is still under exploration due to the lack of hand
annotations. Some researchers have tried to use off-line hand pose trackers to
obtain hand keypoints and aid in recognizing sign language via recurrent neural
networks. Nevertheless, none of them outperforms RGB-based approaches yet. To
this end, we propose a novel Skeleton Aware Multi-modal Framework with a Global
Ensemble Model (GEM) for isolated SLR (SAM-SLR-v2) to learn and fuse
multi-modal feature representations towards a higher recognition rate.
Specifically, we propose a Sign Language Graph Convolution Network (SL-GCN) to
model the embedded dynamics of skeleton keypoints and a Separable
Spatial-Temporal Convolution Network (SSTCN) to exploit skeleton features. The
skeleton-based predictions are fused with other RGB and depth based modalities
by the proposed late-fusion GEM to provide global information and make a
faithful SLR prediction. Experiments on three isolated SLR datasets demonstrate
that our proposed SAM-SLR-v2 framework is exceedingly effective and achieves
state-of-the-art performance with significant margins. Our code will be
available at https://github.com/jackyjsy/SAM-SLR-v2</abstract><doi>10.48550/arxiv.2110.06161</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2110.06161 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2110_06161 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition |
title | Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T09%3A18%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Sign%20Language%20Recognition%20via%20Skeleton-Aware%20Multi-Model%20Ensemble&rft.au=Jiang,%20Songyao&rft.date=2021-10-12&rft_id=info:doi/10.48550/arxiv.2110.06161&rft_dat=%3Carxiv_GOX%3E2110_06161%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |