Speech recognition using visual cues with a two stage detector network for visemes classification and sentence detection
In recent times automated lip reading has gained more research attention. Much advancement are being developed in the area using various deep learning algorithms. Automated lip reading can be performed with or without audio. When lip movements are identified without sound of speech it is often refer...
Gespeichert in:
Hauptverfasser: | , |
---|---|
Format: | Tagungsbericht |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | In recent times automated lip reading has gained more research attention. Much advancement are being developed in the area using various deep learning algorithms. Automated lip reading can be performed with or without audio. When lip movements are identified without sound of speech it is often referred as visual speech recognition. One of the drawbacks in visual speech recognition is the detecting words that have similar lip movement. Visemes are often referred as speech sounds that look same for different words that have similar lip movement. In this paper an Inception encoder decoder network is proposed for visual lip-reading. The model is developed to detect the lip-read sentences from a variety of vocabulary and other sentences that are not included in model training. In proposed method visemes are classified and used as a classification method for detecting lip reading sentences. Detected visemes are converted to sentences using perplexity analysis. Proposed method is lexicon free and totally based on visual cues. Visemes are used. The model has been experimented on Lip Reading Sentences 3 TED benchmark dataset containing various challenging videos from TED and TEDX talks. Besides the LRS-3 dataset additionally various experiments have been performed with videos of varying illumination. Proposed model achieved state of art results compared with current lip-reading models. Experimental results show that developed model performs significantly with 13% lower error rate with more robustness to varying illumination. |
---|---|
ISSN: | 0094-243X 1551-7616 |
DOI: | 10.1063/5.0217205 |