Text representation model of scientific papers based on fusing multi-viewpoint information and its quality assessment

Text representation is the preliminary work for in-depth analysis and mining of information in scientific papers. It directly affects the effects of downstream tasks such as, scientific papers classification, clustering, and similarity calculation. However, recent researches mainly considered citati...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Scientometrics 2021-08, Vol.126 (8), p.6937-6963
Hauptverfasser: Lu, Yonghe, Luo, Jiayi, Xiao, Ying, Zhu, Hou
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Text representation is the preliminary work for in-depth analysis and mining of information in scientific papers. It directly affects the effects of downstream tasks such as, scientific papers classification, clustering, and similarity calculation. However, recent researches mainly considered citation network and partial structural information, which is insufficient when representing scientific papers. Therefore, in order to improve the performance of text representation model, this paper proposed MV-HATrans, a text representation model that combines multi-viewpoint information, such as the semantic information of knowledge graph and structural information. This model extracts word information from three aspects, including contextual content, part of speech, and word meaning of WordNet. Based on combination of hierarchical attention mechanism and transformer, the model achieves the full text representation of scientific papers. Finally, this paper uses the binary experimental dataset AAPR, which indicates whether scientific papers are accepted or not, and applies the proposed model of text representation to achieve the goal of automatic quality assessment. Results show that in the quality classification of scientific papers, adopting part-of-speech information and semantic information based on WordNet definitions can effectively achieve the accuracy of prediction as 70.14%. Among all the structural modules, authors and abstracts contributes the most to the quality classification of scientific papers, especially authors as 9.51%.
ISSN:0138-9130
1588-2861
DOI:10.1007/s11192-021-04028-4