Learning entity-centric document representations using an entity facet topic model

•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Information processing & management 2020-05, Vol.57 (3), p.102216, Article 102216
Hauptverfasser: Wu, Chuan, Kanoulas, Evangelos, Rijke, Maarten de
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using qualitative and quantitative analysis, and identify a effective number of facets per entity.•We demonstrate the effectiveness of EFTM in downstream applications using a multilabel classification task. Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document. In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets11To avoid unnecessary ambiguity, we use facet instead of both aspect and facet across this work.), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics. Through experimentation we confirm that (1) entities
ISSN:0306-4573
1873-5371
DOI:10.1016/j.ipm.2020.102216