Direct and latent modeling techniques for computing spoken document similarity

Document similarity measures are required for a variety of data organization and retrieval tasks including document clustering, document link detection, and query-by-example document retrieval. In this paper we examine existing and novel document similarity measures for use with spoken document coll...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
1. Verfasser: Hazen, T J
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Document similarity measures are required for a variety of data organization and retrieval tasks including document clustering, document link detection, and query-by-example document retrieval. In this paper we examine existing and novel document similarity measures for use with spoken document collections processed with automatic speech recognition (ASR) technology. We compare direct vector space approaches using the cosine similarity measure applied to feature vectors constructed with various forms of term frequency inverse document frequency (TF-IDF) normalization against latent topic modeling approaches based on latent Dirichlet allocation (LDA). In document link detection experiments on the Fisher Corpus, we find that an approach that applies bagging to models derived from LDA substantially outperforms the direct vector space approach.
DOI:10.1109/SLT.2010.5700880