VISUAL SPEECH RECOGNITION BY PHONEME PREDICTION

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing visual speech recognition. In one aspect, a method comprises receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips; processing the vid...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shillingford, Brendan, Gomes de Freitas, Joao Ferdinando, Assael, Ioannis Alexandros
Format:	Patent
Sprache:	eng
Schlagworte:	ACOUSTICS CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING HANDLING RECORD CARRIERS IMAGE DATA PROCESSING OR GENERATION, IN GENERAL MUSICAL INSTRUMENTS PHYSICS PRESENTATION OF DATA RECOGNITION OF DATA RECORD CARRIERS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Shillingford, Brendan Gomes de Freitas, Joao Ferdinando Assael, Ioannis Alexandros
description	Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing visual speech recognition. In one aspect, a method comprises receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips; processing the video using a visual speech recognition neural network to generate, for each output position in an output sequence, a respective output score for each token in a vocabulary of possible tokens, wherein the visual speech recognition neural network comprises one or more volumetric convolutional neural network layers and one or more time-aggregation neural network layers; wherein the vocabulary of possible tokens comprises a plurality of phonemes; and determining a sequence of words expressed by the pair of lips depicted in the video using the output scores.
format	Patent
fullrecord	<record><control><sourceid>epo_EVB</sourceid><recordid>TN_cdi_epo_espacenet_US2021110831A1</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>US2021110831A1</sourcerecordid><originalsourceid>FETCH-epo_espacenet_US2021110831A13</originalsourceid><addsrcrecordid>eNrjZNAP8wwOdfRRCA5wdXX2UAhydfZ39_MM8fT3U3CKVAjw8Pdz9XVVCAhydfF0BonyMLCmJeYUp_JCaW4GZTfXEGcP3dSC_PjU4oLE5NS81JL40GAjAyNDQ0MDC2NDR0Nj4lQBAKrgJvU</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>patent</recordtype></control><display><type>patent</type><title>VISUAL SPEECH RECOGNITION BY PHONEME PREDICTION</title><source>esp@cenet</source><creator>Shillingford, Brendan ; Gomes de Freitas, Joao Ferdinando ; Assael, Ioannis Alexandros</creator><creatorcontrib>Shillingford, Brendan ; Gomes de Freitas, Joao Ferdinando ; Assael, Ioannis Alexandros</creatorcontrib><description>Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing visual speech recognition. In one aspect, a method comprises receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips; processing the video using a visual speech recognition neural network to generate, for each output position in an output sequence, a respective output score for each token in a vocabulary of possible tokens, wherein the visual speech recognition neural network comprises one or more volumetric convolutional neural network layers and one or more time-aggregation neural network layers; wherein the vocabulary of possible tokens comprises a plurality of phonemes; and determining a sequence of words expressed by the pair of lips depicted in the video using the output scores.</description><language>eng</language><subject>ACOUSTICS ; CALCULATING ; COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS ; COMPUTING ; COUNTING ; HANDLING RECORD CARRIERS ; IMAGE DATA PROCESSING OR GENERATION, IN GENERAL ; MUSICAL INSTRUMENTS ; PHYSICS ; PRESENTATION OF DATA ; RECOGNITION OF DATA ; RECORD CARRIERS ; SPEECH ANALYSIS OR SYNTHESIS ; SPEECH OR AUDIO CODING OR DECODING ; SPEECH OR VOICE PROCESSING ; SPEECH RECOGNITION</subject><creationdate>2021</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20210415&DB=EPODOC&CC=US&NR=2021110831A1$$EHTML$$P50$$Gepo$$Hfree_for_read</linktohtml><link.rule.ids>230,308,778,883,25547,76298</link.rule.ids><linktorsrc>$$Uhttps://worldwide.espacenet.com/publicationDetails/biblio?FT=D&date=20210415&DB=EPODOC&CC=US&NR=2021110831A1$$EView_record_in_European_Patent_Office$$FView_record_in_$$GEuropean_Patent_Office$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>Shillingford, Brendan</creatorcontrib><creatorcontrib>Gomes de Freitas, Joao Ferdinando</creatorcontrib><creatorcontrib>Assael, Ioannis Alexandros</creatorcontrib><title>VISUAL SPEECH RECOGNITION BY PHONEME PREDICTION</title><description>Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing visual speech recognition. In one aspect, a method comprises receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips; processing the video using a visual speech recognition neural network to generate, for each output position in an output sequence, a respective output score for each token in a vocabulary of possible tokens, wherein the visual speech recognition neural network comprises one or more volumetric convolutional neural network layers and one or more time-aggregation neural network layers; wherein the vocabulary of possible tokens comprises a plurality of phonemes; and determining a sequence of words expressed by the pair of lips depicted in the video using the output scores.</description><subject>ACOUSTICS</subject><subject>CALCULATING</subject><subject>COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS</subject><subject>COMPUTING</subject><subject>COUNTING</subject><subject>HANDLING RECORD CARRIERS</subject><subject>IMAGE DATA PROCESSING OR GENERATION, IN GENERAL</subject><subject>MUSICAL INSTRUMENTS</subject><subject>PHYSICS</subject><subject>PRESENTATION OF DATA</subject><subject>RECOGNITION OF DATA</subject><subject>RECORD CARRIERS</subject><subject>SPEECH ANALYSIS OR SYNTHESIS</subject><subject>SPEECH OR AUDIO CODING OR DECODING</subject><subject>SPEECH OR VOICE PROCESSING</subject><subject>SPEECH RECOGNITION</subject><fulltext>true</fulltext><rsrctype>patent</rsrctype><creationdate>2021</creationdate><recordtype>patent</recordtype><sourceid>EVB</sourceid><recordid>eNrjZNAP8wwOdfRRCA5wdXX2UAhydfZ39_MM8fT3U3CKVAjw8Pdz9XVVCAhydfF0BonyMLCmJeYUp_JCaW4GZTfXEGcP3dSC_PjU4oLE5NS81JL40GAjAyNDQ0MDC2NDR0Nj4lQBAKrgJvU</recordid><startdate>20210415</startdate><enddate>20210415</enddate><creator>Shillingford, Brendan</creator><creator>Gomes de Freitas, Joao Ferdinando</creator><creator>Assael, Ioannis Alexandros</creator><scope>EVB</scope></search><sort><creationdate>20210415</creationdate><title>VISUAL SPEECH RECOGNITION BY PHONEME PREDICTION</title><author>Shillingford, Brendan ; Gomes de Freitas, Joao Ferdinando ; Assael, Ioannis Alexandros</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-epo_espacenet_US2021110831A13</frbrgroupid><rsrctype>patents</rsrctype><prefilter>patents</prefilter><language>eng</language><creationdate>2021</creationdate><topic>ACOUSTICS</topic><topic>CALCULATING</topic><topic>COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS</topic><topic>COMPUTING</topic><topic>COUNTING</topic><topic>HANDLING RECORD CARRIERS</topic><topic>IMAGE DATA PROCESSING OR GENERATION, IN GENERAL</topic><topic>MUSICAL INSTRUMENTS</topic><topic>PHYSICS</topic><topic>PRESENTATION OF DATA</topic><topic>RECOGNITION OF DATA</topic><topic>RECORD CARRIERS</topic><topic>SPEECH ANALYSIS OR SYNTHESIS</topic><topic>SPEECH OR AUDIO CODING OR DECODING</topic><topic>SPEECH OR VOICE PROCESSING</topic><topic>SPEECH RECOGNITION</topic><toplevel>online_resources</toplevel><creatorcontrib>Shillingford, Brendan</creatorcontrib><creatorcontrib>Gomes de Freitas, Joao Ferdinando</creatorcontrib><creatorcontrib>Assael, Ioannis Alexandros</creatorcontrib><collection>esp@cenet</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shillingford, Brendan</au><au>Gomes de Freitas, Joao Ferdinando</au><au>Assael, Ioannis Alexandros</au><format>patent</format><genre>patent</genre><ristype>GEN</ristype><title>VISUAL SPEECH RECOGNITION BY PHONEME PREDICTION</title><date>2021-04-15</date><risdate>2021</risdate><abstract>Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for performing visual speech recognition. In one aspect, a method comprises receiving a video comprising a plurality of video frames, wherein each video frame depicts a pair of lips; processing the video using a visual speech recognition neural network to generate, for each output position in an output sequence, a respective output score for each token in a vocabulary of possible tokens, wherein the visual speech recognition neural network comprises one or more volumetric convolutional neural network layers and one or more time-aggregation neural network layers; wherein the vocabulary of possible tokens comprises a plurality of phonemes; and determining a sequence of words expressed by the pair of lips depicted in the video using the output scores.</abstract><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier
ispartof
issn
language	eng
recordid	cdi_epo_espacenet_US2021110831A1
source	esp@cenet
subjects	ACOUSTICS CALCULATING COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS COMPUTING COUNTING HANDLING RECORD CARRIERS IMAGE DATA PROCESSING OR GENERATION, IN GENERAL MUSICAL INSTRUMENTS PHYSICS PRESENTATION OF DATA RECOGNITION OF DATA RECORD CARRIERS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION
title	VISUAL SPEECH RECOGNITION BY PHONEME PREDICTION
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T23%3A16%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-epo_EVB&rft_val_fmt=info:ofi/fmt:kev:mtx:patent&rft.genre=patent&rft.au=Shillingford,%20Brendan&rft.date=2021-04-15&rft_id=info:doi/&rft_dat=%3Cepo_EVB%3EUS2021110831A1%3C/epo_EVB%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true