Structured information extraction from scientific text with large language models

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Nature communications 2024-02, Vol.15 (1), p.1418-1418, Article 1418
Hauptverfasser:	Dagdelen, John, Dunn, Alexander, Lee, Sanghoon, Walker, Nicholas, Rosen, Andrew S., Ceder, Gerbrand, Persson, Kristin A., Jain, Anubhav
Format:	Artikel
Sprache:	eng
Schlagworte:	639/301 639/301/1034 639/705/1046 706/648/697/129 Accessibility Chatbots Chemistry Humanities and Social Sciences Information retrieval Knowledge KNOWLEDGE MANAGEMENT AND PRESERVATION Language Large language models Machine learning Materials science Metal-organic frameworks Morphology multidisciplinary Nanoparticles Natural language processing Science Science (multidisciplinary) Semantics Sentences Task complexity Thin films Zinc oxides
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers. Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.
ISSN:	2041-1723 2041-1723
DOI:	10.1038/s41467-024-45563-x