VISUAL SPEECH RECOGNITION BY PHONEME PREDICTION
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing visual speech recognition. In one aspect, a method comprises receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips; processing the vid...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Patent |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing visual speech recognition. In one aspect, a method comprises receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips; processing the video using a visual speech recognition neural network to generate, for each output position in an output sequence, a respective output score for each token in a vocabulary of possible tokens, wherein the visual speech recognition neural network comprises one or more volumetric convolutional neural network layers and one or more time-aggregation neural network layers; wherein the vocabulary of possible tokens comprises a plurality of phonemes; and determining a sequence of words expressed by the pair of lips depicted in the video using the output scores. |
---|