Learning entity-centric document representations using an entity facet topic model
•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using...
Gespeichert in:
Veröffentlicht in: | Information processing & management 2020-05, Vol.57 (3), p.102216, Article 102216 |
---|---|
Hauptverfasser: | , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | 3 |
container_start_page | 102216 |
container_title | Information processing & management |
container_volume | 57 |
creator | Wu, Chuan Kanoulas, Evangelos Rijke, Maarten de |
description | •We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using qualitative and quantitative analysis, and identify a effective number of facets per entity.•We demonstrate the effectiveness of EFTM in downstream applications using a multilabel classification task.
Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document.
In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets11To avoid unnecessary ambiguity, we use facet instead of both aspect and facet across this work.), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics.
Through experimentation we confirm that (1) entities |
doi_str_mv | 10.1016/j.ipm.2020.102216 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2442321144</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0306457318308008</els_id><sourcerecordid>2442321144</sourcerecordid><originalsourceid>FETCH-LOGICAL-c368t-f0cf7062d4713b580dc80d382544f324c3d2a0e1f729e3556e80ec6da4de44c43</originalsourceid><addsrcrecordid>eNp9kE1LxDAQhoMouK7-AG8Fz13z3YonWfyCBUH0HGIykZRtU5NU2H9vSvfsYXhn4H1mhheha4I3BBN522382G8opvNMKZEnaEXahtWCNeQUrTDDsuaiYefoIqUOY8wFoSv0vgMdBz98VzBknw-1KRq9qWwwU1_6KsIYIZVOZx-GVE1pduvhCFROG8hVDmOB-mBhf4nOnN4nuDrqGn0-PX5sX-rd2_Pr9mFXGybbXDtsXIMltbwh7Eu02JpSrKWCc8coN8xSjYG4ht4BE0JCi8FIq7kFzg1na3Sz7B1j-JkgZdWFKQ7lpKKcU0YJ4bOLLC4TQ0oRnBqj73U8KILVHJ3qVIlOzdGpJbrC3C8MlPd_PUSVjIfBgPURTFY2-H_oP_vydtE</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2442321144</pqid></control><display><type>article</type><title>Learning entity-centric document representations using an entity facet topic model</title><source>Access via ScienceDirect (Elsevier)</source><creator>Wu, Chuan ; Kanoulas, Evangelos ; Rijke, Maarten de</creator><creatorcontrib>Wu, Chuan ; Kanoulas, Evangelos ; Rijke, Maarten de</creatorcontrib><description>•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using qualitative and quantitative analysis, and identify a effective number of facets per entity.•We demonstrate the effectiveness of EFTM in downstream applications using a multilabel classification task.
Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document.
In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets11To avoid unnecessary ambiguity, we use facet instead of both aspect and facet across this work.), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics.
Through experimentation we confirm that (1) entities are multi-faceted concepts which we can model and learn, (2) a multi-faceted entity-centric modeling of documents can lead to effective representations, which (3) can have an impact in downstream application, and (4) considering a small number of facets is effective enough. In particular, we visualize entity facets within a set of documents, and demonstrate that indeed different sets of documents reflect different facets of entities. Further, we demonstrate that the proposed entity facet topic model generates better document representations in terms of perplexity, compared to state-of-the-art document representation methods. Moreover, we show that the proposed model outperforms baseline methods in the application of multi-label classification. Finally, we study the impact of EFTM’s parameters and find that a small number of facets better captures entity specific topics, which confirms the intuition that on average an entity has a small number of facets reflected in documents.</description><identifier>ISSN: 0306-4573</identifier><identifier>EISSN: 1873-5371</identifier><identifier>DOI: 10.1016/j.ipm.2020.102216</identifier><language>eng</language><publisher>Oxford: Elsevier Ltd</publisher><subject>Algorithms ; Automatic text analysis ; Classification ; Document representation ; Entity aspects ; Experimentation ; Information management ; Information processing ; Information retrieval ; Learning ; Machine learning ; Questions ; Representations ; Text classification ; Topic models</subject><ispartof>Information processing & management, 2020-05, Vol.57 (3), p.102216, Article 102216</ispartof><rights>2020 Elsevier Ltd</rights><rights>Copyright Pergamon Press Inc. May 2020</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c368t-f0cf7062d4713b580dc80d382544f324c3d2a0e1f729e3556e80ec6da4de44c43</citedby><cites>FETCH-LOGICAL-c368t-f0cf7062d4713b580dc80d382544f324c3d2a0e1f729e3556e80ec6da4de44c43</cites><orcidid>0000-0002-1086-0202</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://dx.doi.org/10.1016/j.ipm.2020.102216$$EHTML$$P50$$Gelsevier$$H</linktohtml><link.rule.ids>314,780,784,3550,27924,27925,45995</link.rule.ids></links><search><creatorcontrib>Wu, Chuan</creatorcontrib><creatorcontrib>Kanoulas, Evangelos</creatorcontrib><creatorcontrib>Rijke, Maarten de</creatorcontrib><title>Learning entity-centric document representations using an entity facet topic model</title><title>Information processing & management</title><description>•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using qualitative and quantitative analysis, and identify a effective number of facets per entity.•We demonstrate the effectiveness of EFTM in downstream applications using a multilabel classification task.
Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document.
In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets11To avoid unnecessary ambiguity, we use facet instead of both aspect and facet across this work.), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics.
Through experimentation we confirm that (1) entities are multi-faceted concepts which we can model and learn, (2) a multi-faceted entity-centric modeling of documents can lead to effective representations, which (3) can have an impact in downstream application, and (4) considering a small number of facets is effective enough. In particular, we visualize entity facets within a set of documents, and demonstrate that indeed different sets of documents reflect different facets of entities. Further, we demonstrate that the proposed entity facet topic model generates better document representations in terms of perplexity, compared to state-of-the-art document representation methods. Moreover, we show that the proposed model outperforms baseline methods in the application of multi-label classification. Finally, we study the impact of EFTM’s parameters and find that a small number of facets better captures entity specific topics, which confirms the intuition that on average an entity has a small number of facets reflected in documents.</description><subject>Algorithms</subject><subject>Automatic text analysis</subject><subject>Classification</subject><subject>Document representation</subject><subject>Entity aspects</subject><subject>Experimentation</subject><subject>Information management</subject><subject>Information processing</subject><subject>Information retrieval</subject><subject>Learning</subject><subject>Machine learning</subject><subject>Questions</subject><subject>Representations</subject><subject>Text classification</subject><subject>Topic models</subject><issn>0306-4573</issn><issn>1873-5371</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><recordid>eNp9kE1LxDAQhoMouK7-AG8Fz13z3YonWfyCBUH0HGIykZRtU5NU2H9vSvfsYXhn4H1mhheha4I3BBN522382G8opvNMKZEnaEXahtWCNeQUrTDDsuaiYefoIqUOY8wFoSv0vgMdBz98VzBknw-1KRq9qWwwU1_6KsIYIZVOZx-GVE1pduvhCFROG8hVDmOB-mBhf4nOnN4nuDrqGn0-PX5sX-rd2_Pr9mFXGybbXDtsXIMltbwh7Eu02JpSrKWCc8coN8xSjYG4ht4BE0JCi8FIq7kFzg1na3Sz7B1j-JkgZdWFKQ7lpKKcU0YJ4bOLLC4TQ0oRnBqj73U8KILVHJ3qVIlOzdGpJbrC3C8MlPd_PUSVjIfBgPURTFY2-H_oP_vydtE</recordid><startdate>202005</startdate><enddate>202005</enddate><creator>Wu, Chuan</creator><creator>Kanoulas, Evangelos</creator><creator>Rijke, Maarten de</creator><general>Elsevier Ltd</general><general>Elsevier Science Ltd</general><scope>AAYXX</scope><scope>CITATION</scope><scope>E3H</scope><scope>F2A</scope><orcidid>https://orcid.org/0000-0002-1086-0202</orcidid></search><sort><creationdate>202005</creationdate><title>Learning entity-centric document representations using an entity facet topic model</title><author>Wu, Chuan ; Kanoulas, Evangelos ; Rijke, Maarten de</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c368t-f0cf7062d4713b580dc80d382544f324c3d2a0e1f729e3556e80ec6da4de44c43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Algorithms</topic><topic>Automatic text analysis</topic><topic>Classification</topic><topic>Document representation</topic><topic>Entity aspects</topic><topic>Experimentation</topic><topic>Information management</topic><topic>Information processing</topic><topic>Information retrieval</topic><topic>Learning</topic><topic>Machine learning</topic><topic>Questions</topic><topic>Representations</topic><topic>Text classification</topic><topic>Topic models</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Wu, Chuan</creatorcontrib><creatorcontrib>Kanoulas, Evangelos</creatorcontrib><creatorcontrib>Rijke, Maarten de</creatorcontrib><collection>CrossRef</collection><collection>Library & Information Sciences Abstracts (LISA)</collection><collection>Library & Information Science Abstracts (LISA)</collection><jtitle>Information processing & management</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Chuan</au><au>Kanoulas, Evangelos</au><au>Rijke, Maarten de</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Learning entity-centric document representations using an entity facet topic model</atitle><jtitle>Information processing & management</jtitle><date>2020-05</date><risdate>2020</risdate><volume>57</volume><issue>3</issue><spage>102216</spage><pages>102216-</pages><artnum>102216</artnum><issn>0306-4573</issn><eissn>1873-5371</eissn><abstract>•We propose the task of entity-centric document representation learning.•We propose a novel Entity Facet Topic Model (EFTM) to learn entity-centric document representations.•We confirm our hypothesis regarding the existence of multiple facets of an entity by analysing the learned entity facets using qualitative and quantitative analysis, and identify a effective number of facets per entity.•We demonstrate the effectiveness of EFTM in downstream applications using a multilabel classification task.
Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document.
In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets11To avoid unnecessary ambiguity, we use facet instead of both aspect and facet across this work.), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics.
Through experimentation we confirm that (1) entities are multi-faceted concepts which we can model and learn, (2) a multi-faceted entity-centric modeling of documents can lead to effective representations, which (3) can have an impact in downstream application, and (4) considering a small number of facets is effective enough. In particular, we visualize entity facets within a set of documents, and demonstrate that indeed different sets of documents reflect different facets of entities. Further, we demonstrate that the proposed entity facet topic model generates better document representations in terms of perplexity, compared to state-of-the-art document representation methods. Moreover, we show that the proposed model outperforms baseline methods in the application of multi-label classification. Finally, we study the impact of EFTM’s parameters and find that a small number of facets better captures entity specific topics, which confirms the intuition that on average an entity has a small number of facets reflected in documents.</abstract><cop>Oxford</cop><pub>Elsevier Ltd</pub><doi>10.1016/j.ipm.2020.102216</doi><orcidid>https://orcid.org/0000-0002-1086-0202</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 0306-4573 |
ispartof | Information processing & management, 2020-05, Vol.57 (3), p.102216, Article 102216 |
issn | 0306-4573 1873-5371 |
language | eng |
recordid | cdi_proquest_journals_2442321144 |
source | Access via ScienceDirect (Elsevier) |
subjects | Algorithms Automatic text analysis Classification Document representation Entity aspects Experimentation Information management Information processing Information retrieval Learning Machine learning Questions Representations Text classification Topic models |
title | Learning entity-centric document representations using an entity facet topic model |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-29T20%3A57%3A32IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Learning%20entity-centric%20document%20representations%20using%20an%20entity%20facet%20topic%20model&rft.jtitle=Information%20processing%20&%20management&rft.au=Wu,%20Chuan&rft.date=2020-05&rft.volume=57&rft.issue=3&rft.spage=102216&rft.pages=102216-&rft.artnum=102216&rft.issn=0306-4573&rft.eissn=1873-5371&rft_id=info:doi/10.1016/j.ipm.2020.102216&rft_dat=%3Cproquest_cross%3E2442321144%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2442321144&rft_id=info:pmid/&rft_els_id=S0306457318308008&rfr_iscdi=true |