Named entity recognition in the perovskite field based on convolutional neural networks and MatBERT

[Display omitted] •A public perovskite labeling dataset is provided.•The MatBERT-CNN-CRF model was constructed.•The F1 score of the new model has improved to 90.8% Due to the significant increase in publications in the field of materials science, there has been a bottleneck in organizing material sc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computational materials science 2024-05, Vol.240, p.113014, Article 113014
Hauptverfasser: Zhang, Jiaxin, Zhang, Lingxue, Sun, Yuxuan, Li, Wei, Quhe, Ruge
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:[Display omitted] •A public perovskite labeling dataset is provided.•The MatBERT-CNN-CRF model was constructed.•The F1 score of the new model has improved to 90.8% Due to the significant increase in publications in the field of materials science, there has been a bottleneck in organizing material science knowledge and discovering new materials. The number of literature in the emerging field of perovskite materials has grown to a massive scale. It is necessary to compile information on the structure, properties, synthesis methods, characterization techniques, and applications of perovskite materials. To address this issue, we employed named entity recognition, a natural language processing technique, to extract important entities from perovskite material texts. In this paper, we propose a method based on convolutional neural networks (CNN) and MatBERT. Firstly, we utilized MatBERT, which has been pre-trained on a large amount of material science text, to generate contextualized word embeddings. Next, we extracted feature information using a CNN model. Finally, a conditional random field (CRF) layer was used for decoding sequences in addition to calculating the training and validation loss. Experimental results demonstrated that the performance of our model on perovskite material dataset was improved by 1 %∼6% compared with BERT, SciBERT and MatBERT models. Through this model, we extracted the entities of 2389 abstracts to obtain knowledge of perovskite materials.
ISSN:0927-0256
1879-0801
DOI:10.1016/j.commatsci.2024.113014