Learning entity-centric document representations using an entity facet topic model

•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Information processing & management 2020-05, Vol.57 (3), p.102216, Article 102216
Hauptverfasser:	Wu, Chuan, Kanoulas, Evangelos, Rijke, Maarten de
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Automatic text analysis Classification Document representation Entity aspects Experimentation Information management Information processing Information retrieval Learning Machine learning Questions Representations Text classification Topic models
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue	3
container_start_page	102216
container_title	Information processing & management
container_volume	57
creator	Wu, Chuan Kanoulas, Evangelos Rijke, Maarten de
description	•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using qualitative and quantitative analysis, and identify a effective number of facets per entity.•We demonstrate the effectiveness of EFTM in downstream applications using a multilabel classification task. Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document. In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets11To avoid unnecessary ambiguity, we use facet instead of both aspect and facet across this work.), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics. Through experimentation we confirm that (1) entities
doi_str_mv	10.1016/j.ipm.2020.102216
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2442321144</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0306457318308008</els_id><sourcerecordid>2442321144</sourcerecordid><originalsourceid>FETCH-LOGICAL-c368t-f0cf7062d4713b580dc80d382544f324c3d2a0e1f729e3556e80ec6da4de44c43</originalsourceid><addsrcrecordid>eNp9kE1LxDAQhoMouK7-AG8Fz13z3YonWfyCBUH0HGIykZRtU5NU2H9vSvfsYXhn4H1mhheha4I3BBN522382G8opvNMKZEnaEXahtWCNeQUrTDDsuaiYefoIqUOY8wFoSv0vgMdBz98VzBknw-1KRq9qWwwU1_6KsIYIZVOZx-GVE1pduvhCFROG8hVDmOB-mBhf4nOnN4nuDrqGn0-PX5sX-rd2_Pr9mFXGybbXDtsXIMltbwh7Eu02JpSrKWCc8coN8xSjYG4ht4BE0JCi8FIq7kFzg1na3Sz7B1j-JkgZdWFKQ7lpKKcU0YJ4bOLLC4TQ0oRnBqj73U8KILVHJ3qVIlOzdGpJbrC3C8MlPd_PUSVjIfBgPURTFY2-H_oP_vydtE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2442321144</pqid></control><display><type>article</type><title>Learning entity-centric document representations using an entity facet topic model</title><source>Access via ScienceDirect (Elsevier)</source><creator>Wu, Chuan ; Kanoulas, Evangelos ; Rijke, Maarten de</creator><creatorcontrib>Wu, Chuan ; Kanoulas, Evangelos ; Rijke, Maarten de</creatorcontrib><description>•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using qualitative and quantitative analysis, and identify a effective number of facets per entity.•We demonstrate the effectiveness of EFTM in downstream applications using a multilabel classification task. Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document. In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets11To avoid unnecessary ambiguity, we use facet instead of both aspect and facet across this work.), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics. Through experimentation we confirm that (1) entities are multi-faceted concepts which we can model and learn, (2) a multi-faceted entity-centric modeling of documents can lead to effective representations, which (3) can have an impact in downstream application, and (4) considering a small number of facets is effective enough. In particular, we visualize entity facets within a set of documents, and demonstrate that indeed different sets of documents reflect different facets of entities. Further, we demonstrate that the proposed entity facet topic model generates better document representations in terms of perplexity, compared to state-of-the-art document representation methods. Moreover, we show that the proposed model outperforms baseline methods in the application of multi-label classification. Finally, we study the impact of EFTM’s parameters and find that a small number of facets better captures entity specific topics, which confirms the intuition that on average an entity has a small number of facets reflected in documents.</description><identifier>ISSN: 0306-4573</identifier><identifier>EISSN: 1873-5371</identifier><identifier>DOI: 10.1016/j.ipm.2020.102216</identifier><language>eng</language><publisher>Oxford: Elsevier Ltd</publisher><subject>Algorithms ; Automatic text analysis ; Classification ; Document representation ; Entity aspects ; Experimentation ; Information management ; Information processing ; Information retrieval ; Learning ; Machine learning ; Questions ; Representations ; Text classification ; Topic models</subject><ispartof>Information processing & management, 2020-05, Vol.57 (3), p.102216, Article 102216</ispartof><rights>2020 Elsevier Ltd</rights><rights>Copyright Pergamon Press Inc. May 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c368t-f0cf7062d4713b580dc80d382544f324c3d2a0e1f729e3556e80ec6da4de44c43</citedby><cites>FETCH-LOGICAL-c368t-f0cf7062d4713b580dc80d382544f324c3d2a0e1f729e3556e80ec6da4de44c43</cites><orcidid>0000-0002-1086-0202</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ipm.2020.102216$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids></links><search><creatorcontrib>Wu, Chuan</creatorcontrib><creatorcontrib>Kanoulas, Evangelos</creatorcontrib><creatorcontrib>Rijke, Maarten de</creatorcontrib><title>Learning entity-centric document representations using an entity facet topic model</title><title>Information processing & management</title><description>•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using qualitative and quantitative analysis, and identify a effective number of facets per entity.•We demonstrate the effectiveness of EFTM in downstream applications using a multilabel classification task. Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document. In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets11To avoid unnecessary ambiguity, we use facet instead of both aspect and facet across this work.), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics. Through experimentation we confirm that (1) entities are multi-faceted concepts which we can model and learn, (2) a multi-faceted entity-centric modeling of documents can lead to effective representations, which (3) can have an impact in downstream application, and (4) considering a small number of facets is effective enough. In particular, we visualize entity facets within a set of documents, and demonstrate that indeed different sets of documents reflect different facets of entities. Further, we demonstrate that the proposed entity facet topic model generates better document representations in terms of perplexity, compared to state-of-the-art document representation methods. Moreover, we show that the proposed model outperforms baseline methods in the application of multi-label classification. Finally, we study the impact of EFTM’s parameters and find that a small number of facets better captures entity specific topics, which confirms the intuition that on average an entity has a small number of facets reflected in documents.</description><subject>Algorithms</subject><subject>Automatic text analysis</subject><subject>Classification</subject><subject>Document representation</subject><subject>Entity aspects</subject><subject>Experimentation</subject><subject>Information management</subject><subject>Information processing</subject><subject>Information retrieval</subject><subject>Learning</subject><subject>Machine learning</subject><subject>Questions</subject><subject>Representations</subject><subject>Text classification</subject><subject>Topic models</subject><issn>0306-4573</issn><issn>1873-5371</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LxDAQhoMouK7-AG8Fz13z3YonWfyCBUH0HGIykZRtU5NU2H9vSvfsYXhn4H1mhheha4I3BBN522382G8opvNMKZEnaEXahtWCNeQUrTDDsuaiYefoIqUOY8wFoSv0vgMdBz98VzBknw-1KRq9qWwwU1_6KsIYIZVOZx-GVE1pduvhCFROG8hVDmOB-mBhf4nOnN4nuDrqGn0-PX5sX-rd2_Pr9mFXGybbXDtsXIMltbwh7Eu02JpSrKWCc8coN8xSjYG4ht4BE0JCi8FIq7kFzg1na3Sz7B1j-JkgZdWFKQ7lpKKcU0YJ4bOLLC4TQ0oRnBqj73U8KILVHJ3qVIlOzdGpJbrC3C8MlPd_PUSVjIfBgPURTFY2-H_oP_vydtE</recordid><startdate>202005</startdate><enddate>202005</enddate><creator>Wu, Chuan</creator><creator>Kanoulas, Evangelos</creator><creator>Rijke, Maarten de</creator><general>Elsevier Ltd</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope><orcidid>https://orcid.org/0000-0002-1086-0202</orcidid></search><sort><creationdate>202005</creationdate><title>Learning entity-centric document representations using an entity facet topic model</title><author>Wu, Chuan ; Kanoulas, Evangelos ; Rijke, Maarten de</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c368t-f0cf7062d4713b580dc80d382544f324c3d2a0e1f729e3556e80ec6da4de44c43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Automatic text analysis</topic><topic>Classification</topic><topic>Document representation</topic><topic>Entity aspects</topic><topic>Experimentation</topic><topic>Information management</topic><topic>Information processing</topic><topic>Information retrieval</topic><topic>Learning</topic><topic>Machine learning</topic><topic>Questions</topic><topic>Representations</topic><topic>Text classification</topic><topic>Topic models</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Chuan</creatorcontrib><creatorcontrib>Kanoulas, Evangelos</creatorcontrib><creatorcontrib>Rijke, Maarten de</creatorcontrib><collection>CrossRef</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><jtitle>Information processing & management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Chuan</au><au>Kanoulas, Evangelos</au><au>Rijke, Maarten de</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Learning entity-centric document representations using an entity facet topic model</atitle><jtitle>Information processing & management</jtitle><date>2020-05</date><risdate>2020</risdate><volume>57</volume><issue>3</issue><spage>102216</spage><pages>102216-</pages><artnum>102216</artnum><issn>0306-4573</issn><eissn>1873-5371</eissn><abstract>•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using qualitative and quantitative analysis, and identify a effective number of facets per entity.•We demonstrate the effectiveness of EFTM in downstream applications using a multilabel classification task. Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document. In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets11To avoid unnecessary ambiguity, we use facet instead of both aspect and facet across this work.), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics. Through experimentation we confirm that (1) entities are multi-faceted concepts which we can model and learn, (2) a multi-faceted entity-centric modeling of documents can lead to effective representations, which (3) can have an impact in downstream application, and (4) considering a small number of facets is effective enough. In particular, we visualize entity facets within a set of documents, and demonstrate that indeed different sets of documents reflect different facets of entities. Further, we demonstrate that the proposed entity facet topic model generates better document representations in terms of perplexity, compared to state-of-the-art document representation methods. Moreover, we show that the proposed model outperforms baseline methods in the application of multi-label classification. Finally, we study the impact of EFTM’s parameters and find that a small number of facets better captures entity specific topics, which confirms the intuition that on average an entity has a small number of facets reflected in documents.</abstract><cop>Oxford</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.ipm.2020.102216</doi><orcidid>https://orcid.org/0000-0002-1086-0202</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0306-4573
ispartof	Information processing & management, 2020-05, Vol.57 (3), p.102216, Article 102216
issn	0306-4573 1873-5371
language	eng
recordid	cdi_proquest_journals_2442321144
source	Access via ScienceDirect (Elsevier)
subjects	Algorithms Automatic text analysis Classification Document representation Entity aspects Experimentation Information management Information processing Information retrieval Learning Machine learning Questions Representations Text classification Topic models
title	Learning entity-centric document representations using an entity facet topic model
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T20%3A57%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Learning%20entity-centric%20document%20representations%20using%20an%20entity%20facet%20topic%20model&rft.jtitle=Information%20processing%20&%20management&rft.au=Wu,%20Chuan&rft.date=2020-05&rft.volume=57&rft.issue=3&rft.spage=102216&rft.pages=102216-&rft.artnum=102216&rft.issn=0306-4573&rft.eissn=1873-5371&rft_id=info:doi/10.1016/j.ipm.2020.102216&rft_dat=%3Cproquest_cross%3E2442321144%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2442321144&rft_id=info:pmid/&rft_els_id=S0306457318308008&rfr_iscdi=true