SPEAKING CLASSIFICATION USING AUDIO-VISUAL DATA

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating predictions for whether a target person is speaking during a portion of a video. In one aspect, a method includes obtaining one or more images which each depict a mouth of a given person...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	ROTH, Joseph Edward, KLEJCH, Ondrej, CHAUDHURI, Sourish
Format:	Patent
Sprache:	eng ; fre
Schlagworte:	ACOUSTICS MUSICAL INSTRUMENTS PHYSICS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating predictions for whether a target person is speaking during a portion of a video. In one aspect, a method includes obtaining one or more images which each depict a mouth of a given person at a respective time point. The images are processed using an image embedding neural network to generate a latent representation of the images. Audio data corresponding to the images is processed using an audio embedding neural network to generate a latent representation of the audio data. The latent representation of the images and the latent representation of the audio data is processed using a recurrent neural network to generate a prediction for whether the given person is speaking. La présente invention concerne des procédés, des systèmes et un appareil, comprenant des programmes informatiques codés sur un support d'informations informatique, permettant de générer des prédictions afin de savoir si une personne cible parle pendant une partie d'une vidéo. Selon un aspect, un procédé consiste à obtenir une ou plusieurs images qui représentent chacune une bouche d'une personne donnée à un instant respectif. Les images sont traitées à l'aide d'un réseau neuronal d'intégration d'image afin de générer une représentation latente des images. Des données audio correspondant aux images sont traitées à l'aide d'un réseau neuronal d'intégration audio afin de générer une représentation latente des données audio. La représentation latente des images et la représentation latente des données audio sont traitées à l'aide d'un réseau neuronal récurrent permettant de générer une prédiction afin de savoir si la personne donnée parle.