Towards Link Characterization From Content: Recovering Distributions From Classifier Output

In processing large volumes of speech and language data, we are often interested in the distribution of languages, speakers, topics, etc. For large data sets, these distributions are typically estimated at a given point in time using pattern classification technology. It is well known that such esti...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on audio, speech, and language processing speech, and language processing, 2008-05, Vol.16 (4), p.847-858
Hauptverfasser: Grothendieck, John, Gorin, Allen
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:In processing large volumes of speech and language data, we are often interested in the distribution of languages, speakers, topics, etc. For large data sets, these distributions are typically estimated at a given point in time using pattern classification technology. It is well known that such estimates can be highly biased, especially for rare classes. While these biases have been addressed in some applications, they have thus far been ignored in the speech and language literature. This neglect causes significant error for low-frequency classes. Correcting this biased distribution involves exploiting uncertain knowledge of the classifier error patterns. We describe a numerical method, the Metropolis-Hastings (M-H) algorithm, which provides a Bayes estimator for the distribution. We experimentally evaluate this algorithm for a speaker recognition task, demonstrating a fivefold reduction in root mean squared error.
ISSN:1558-7916
2329-9290
1558-7924
2329-9304
DOI:10.1109/TASL.2008.920060