UNSUPERVISED DATA SELECTION VIA DISCRETE SPEECH REPRESENTATION FOR AUTOMATIC SPEECH RECOGNITION

A method (500) includes obtaining a corpus of unlabeled training data (358) that includes a plurality of spoken utterances (360), each corresponding spoken utterance of the plurality of spoken utterances includes audio data (362) characterizing the corresponding spoken utterance. The method also inc...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	HAN, Wei, CHEN, Zhehuai, ZHANG, Yu, HAGHANI, Parisa, WANG, Yongqiang, LU, Zhiyun
Format:	Patent
Sprache:	eng ; fre
Schlagworte:	ACOUSTICS MUSICAL INSTRUMENTS PHYSICS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A method (500) includes obtaining a corpus of unlabeled training data (358) that includes a plurality of spoken utterances (360), each corresponding spoken utterance of the plurality of spoken utterances includes audio data (362) characterizing the corresponding spoken utterance. The method also includes receiving a target domain (324). The method also includes selecting, using a contrastive data selection model (310), a subset of the utterances from the corpus of unlabeled training data that correspond to the target domain. The method includes training an automatic speech recognition (ASR) model (200) on the subset of utterances. Un procédé (500) consiste à obtenir un corpus de données d'entraînement non étiquetées (358) qui comprend une pluralité d'énoncés parlés (360), chaque énoncé parlé correspondant de la pluralité d'énoncés parlés comprenant des données audio (362) caractérisant l'énoncé parlé correspondant. Le procédé consiste également à recevoir un domaine cible (324). Le procédé consiste en outre à sélectionner, à l'aide d'un modèle de sélection de données contrastives (310), un sous-ensemble des énoncés du corpus de données d'entraînement non étiquetées qui correspondent au domaine cible. Le procédé consiste à entraîner un modèle de reconnaissance vocale automatique (ASR) (200) avec le sous-ensemble d'énoncés.