Efficient sign language recognition system and dataset creation method based on deep learning and image processing

New deep-learning architectures are created every year, achieving state-of-the-art results in image recognition and leading to the belief that, in a few years, complex tasks such as sign language translation will be considerably easier, serving as a communication tool for the hearing-impaired commun...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Carneiro, Alvaro Leandro Cavalcante, Silva, Lucas de Brito, Salvadeo, Denis Henrique Pinheiro
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition Computer Science - Human-Computer Interaction Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Carneiro, Alvaro Leandro Cavalcante Silva, Lucas de Brito Salvadeo, Denis Henrique Pinheiro
description	New deep-learning architectures are created every year, achieving state-of-the-art results in image recognition and leading to the belief that, in a few years, complex tasks such as sign language translation will be considerably easier, serving as a communication tool for the hearing-impaired community. On the other hand, these algorithms still need a lot of data to be trained and the dataset creation process is expensive, time-consuming, and slow. Thereby, this work aims to investigate techniques of digital image processing and machine learning that can be used to create a sign language dataset effectively. We argue about data acquisition, such as the frames per second rate to capture or subsample the videos, the background type, preprocessing, and data augmentation, using convolutional neural networks and object detection to create an image classifier and comparing the results based on statistical tests. Different datasets were created to test the hypotheses, containing 14 words used daily and recorded by different smartphones in the RGB color system. We achieved an accuracy of 96.38% on the test set and 81.36% on the validation set containing more challenging conditions, showing that 30 FPS is the best frame rate subsample to train the classifier, geometric transformations work better than intensity transformations, and artificial background creation is not effective to model generalization. These trade-offs should be considered in future work as a cost-benefit guideline between computational cost and accuracy gain when creating a dataset and training a sign recognition model.
doi_str_mv	10.48550/arxiv.2103.12233
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2103_12233</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2103_12233</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-bad4a6ee2301bb27118a0b7b3ef42b54404cd50288531aae771f260d0dc299403</originalsourceid><addsrcrecordid>eNotj81OwzAQhH3hgAoPwAm_QIL_EqdHVJUfqRKX3qO1vTGWEieyDaJvTxo4rWZGM9qPkAfOatU1DXuC9BO-a8GZrLkQUt6SdByGYAPGQnPwkY4Q_Rd4pAnt7GMoYY40X3LBiUJ01EGBjIXahLBlE5bP2VGzuo6u2iEudERIMUS_VcJ03VvSbDHn1bwjNwOMGe__746cX47nw1t1-nh9PzyfKmi1rAw4BS2ikIwbIzTnHTCjjcRBCdMoxZR1DRNd10gOgFrzQbTMMWfFfq-Y3JHHv9kNul_S-ke69Ff4foOXvynqVlY</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Efficient sign language recognition system and dataset creation method based on deep learning and image processing</title><source>arXiv.org</source><creator>Carneiro, Alvaro Leandro Cavalcante ; Silva, Lucas de Brito ; Salvadeo, Denis Henrique Pinheiro</creator><creatorcontrib>Carneiro, Alvaro Leandro Cavalcante ; Silva, Lucas de Brito ; Salvadeo, Denis Henrique Pinheiro</creatorcontrib><description>New deep-learning architectures are created every year, achieving state-of-the-art results in image recognition and leading to the belief that, in a few years, complex tasks such as sign language translation will be considerably easier, serving as a communication tool for the hearing-impaired community. On the other hand, these algorithms still need a lot of data to be trained and the dataset creation process is expensive, time-consuming, and slow. Thereby, this work aims to investigate techniques of digital image processing and machine learning that can be used to create a sign language dataset effectively. We argue about data acquisition, such as the frames per second rate to capture or subsample the videos, the background type, preprocessing, and data augmentation, using convolutional neural networks and object detection to create an image classifier and comparing the results based on statistical tests. Different datasets were created to test the hypotheses, containing 14 words used daily and recorded by different smartphones in the RGB color system. We achieved an accuracy of 96.38% on the test set and 81.36% on the validation set containing more challenging conditions, showing that 30 FPS is the best frame rate subsample to train the classifier, geometric transformations work better than intensity transformations, and artificial background creation is not effective to model generalization. These trade-offs should be considered in future work as a cost-benefit guideline between computational cost and accuracy gain when creating a dataset and training a sign recognition model.</description><identifier>DOI: 10.48550/arxiv.2103.12233</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Human-Computer Interaction ; Computer Science - Learning</subject><creationdate>2021-03</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2103.12233$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2103.12233$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Carneiro, Alvaro Leandro Cavalcante</creatorcontrib><creatorcontrib>Silva, Lucas de Brito</creatorcontrib><creatorcontrib>Salvadeo, Denis Henrique Pinheiro</creatorcontrib><title>Efficient sign language recognition system and dataset creation method based on deep learning and image processing</title><description>New deep-learning architectures are created every year, achieving state-of-the-art results in image recognition and leading to the belief that, in a few years, complex tasks such as sign language translation will be considerably easier, serving as a communication tool for the hearing-impaired community. On the other hand, these algorithms still need a lot of data to be trained and the dataset creation process is expensive, time-consuming, and slow. Thereby, this work aims to investigate techniques of digital image processing and machine learning that can be used to create a sign language dataset effectively. We argue about data acquisition, such as the frames per second rate to capture or subsample the videos, the background type, preprocessing, and data augmentation, using convolutional neural networks and object detection to create an image classifier and comparing the results based on statistical tests. Different datasets were created to test the hypotheses, containing 14 words used daily and recorded by different smartphones in the RGB color system. We achieved an accuracy of 96.38% on the test set and 81.36% on the validation set containing more challenging conditions, showing that 30 FPS is the best frame rate subsample to train the classifier, geometric transformations work better than intensity transformations, and artificial background creation is not effective to model generalization. These trade-offs should be considered in future work as a cost-benefit guideline between computational cost and accuracy gain when creating a dataset and training a sign recognition model.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Human-Computer Interaction</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81OwzAQhH3hgAoPwAm_QIL_EqdHVJUfqRKX3qO1vTGWEieyDaJvTxo4rWZGM9qPkAfOatU1DXuC9BO-a8GZrLkQUt6SdByGYAPGQnPwkY4Q_Rd4pAnt7GMoYY40X3LBiUJ01EGBjIXahLBlE5bP2VGzuo6u2iEudERIMUS_VcJ03VvSbDHn1bwjNwOMGe__746cX47nw1t1-nh9PzyfKmi1rAw4BS2ikIwbIzTnHTCjjcRBCdMoxZR1DRNd10gOgFrzQbTMMWfFfq-Y3JHHv9kNul_S-ke69Ff4foOXvynqVlY</recordid><startdate>20210322</startdate><enddate>20210322</enddate><creator>Carneiro, Alvaro Leandro Cavalcante</creator><creator>Silva, Lucas de Brito</creator><creator>Salvadeo, Denis Henrique Pinheiro</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210322</creationdate><title>Efficient sign language recognition system and dataset creation method based on deep learning and image processing</title><author>Carneiro, Alvaro Leandro Cavalcante ; Silva, Lucas de Brito ; Salvadeo, Denis Henrique Pinheiro</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-bad4a6ee2301bb27118a0b7b3ef42b54404cd50288531aae771f260d0dc299403</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Human-Computer Interaction</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Carneiro, Alvaro Leandro Cavalcante</creatorcontrib><creatorcontrib>Silva, Lucas de Brito</creatorcontrib><creatorcontrib>Salvadeo, Denis Henrique Pinheiro</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Carneiro, Alvaro Leandro Cavalcante</au><au>Silva, Lucas de Brito</au><au>Salvadeo, Denis Henrique Pinheiro</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Efficient sign language recognition system and dataset creation method based on deep learning and image processing</atitle><date>2021-03-22</date><risdate>2021</risdate><abstract>New deep-learning architectures are created every year, achieving state-of-the-art results in image recognition and leading to the belief that, in a few years, complex tasks such as sign language translation will be considerably easier, serving as a communication tool for the hearing-impaired community. On the other hand, these algorithms still need a lot of data to be trained and the dataset creation process is expensive, time-consuming, and slow. Thereby, this work aims to investigate techniques of digital image processing and machine learning that can be used to create a sign language dataset effectively. We argue about data acquisition, such as the frames per second rate to capture or subsample the videos, the background type, preprocessing, and data augmentation, using convolutional neural networks and object detection to create an image classifier and comparing the results based on statistical tests. Different datasets were created to test the hypotheses, containing 14 words used daily and recorded by different smartphones in the RGB color system. We achieved an accuracy of 96.38% on the test set and 81.36% on the validation set containing more challenging conditions, showing that 30 FPS is the best frame rate subsample to train the classifier, geometric transformations work better than intensity transformations, and artificial background creation is not effective to model generalization. These trade-offs should be considered in future work as a cost-benefit guideline between computational cost and accuracy gain when creating a dataset and training a sign recognition model.</abstract><doi>10.48550/arxiv.2103.12233</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2103.12233
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2103_12233
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition Computer Science - Human-Computer Interaction Computer Science - Learning
title	Efficient sign language recognition system and dataset creation method based on deep learning and image processing
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T21%3A49%3A33IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Efficient%20sign%20language%20recognition%20system%20and%20dataset%20creation%20method%20based%20on%20deep%20learning%20and%20image%20processing&rft.au=Carneiro,%20Alvaro%20Leandro%20Cavalcante&rft.date=2021-03-22&rft_id=info:doi/10.48550/arxiv.2103.12233&rft_dat=%3Carxiv_GOX%3E2103_12233%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true