Bayesian nonparametric modeling of hierarchical topics and sentences

Automatically scoring the sentences of multiple documents plays an important role for document summarization. This study presents a new Bayesian nonparametric approach to conduct unsupervised learning of a hierarchical topic and sentence model (HTSM). This HTSM discovers an extended hierarchy in the...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Ying-Lan Chang, Jui-Jung Hung, Jen-Tzung Chien
Format: Tagungsbericht
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Automatically scoring the sentences of multiple documents plays an important role for document summarization. This study presents a new Bayesian nonparametric approach to conduct unsupervised learning of a hierarchical topic and sentence model (HTSM). This HTSM discovers an extended hierarchy in the nested Chinese restaurant process (nCRP) where each sentence is assigned by a hierarchical topic path. A tree structure with distributions ranging from broad topics to precise topics is established. The dependencies among sentences are characterized. The words in different sentences are represented by a shared hierarchical Dirichlet process (HDP). The topic mixtures in word level and sentence level are estimated according to unsupervised nonparametric processes based on HDP and nCRP, respectively. Compared with the nCRP representing a document based on a single path, the proposed HTSM is flexible with a new nCRP where multiple paths are incorporated to generate different sentences of a document. A summarization system is developed to extract semantically-rich sentences from documents. A new Gibbs sampling algorithm is developed to infer the structural parameters of HTSM. In the experiments on DUC corpus, the proposed HTSM outperforms the other methods for document summarization in terms of ROUGE measures.
ISSN:1551-2541
2378-928X
DOI:10.1109/MLSP.2011.6064569