Impact of dataset uncertainties on machine learning model predictions: the example of polymer glass transition temperatures

Over the past decade, there has been a resurgence in the importance of data-driven techniques in materials science and engineering. The utilization of state-of-the art algorithms, coupled with the increased availability of experimental and computational data, has led to the development of surrogate...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Modelling and simulation in materials science and engineering 2019-01, Vol.27 (2), p.24002
Hauptverfasser: Jha, Anurag, Chandrasekaran, Anand, Kim, Chiho, Ramprasad, Rampi
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Over the past decade, there has been a resurgence in the importance of data-driven techniques in materials science and engineering. The utilization of state-of-the art algorithms, coupled with the increased availability of experimental and computational data, has led to the development of surrogate models offering the promise of rapid and accurate predictions of materials' properties based solely on their structure or composition. Such machine learning (ML) models are trained on available past data and are thus susceptible to the intrinsic uncertainties/errors associate with these past measurements. The glass transition temperature (Tg) of polymers, a property of paramount interest in polymer science, is one strong example of a material property that can show widespread variation in the final reported value as a result of a variety of intrinsic and extrinsic factors that occur during the experimental measurement process. In the current work, we curate a large database of Tg measurements from a variety of data sources and proceed to investigate the statistical nature of the inherent uncertainties in the database. Through the partitioning of the dataset using statistically relevant measures, we investigate the effect of variations in the dataset on the performance of the final ML model. We demonstrate that the measure of central tendency, median is a valid approximation when dealing with multiple reported values for Tg when dealing with multiple reported values of Tg for the same polymeric material. Moreover, the Bayesian model noise/uncertainty that emerges from our machine-learning pipeline is able to represent quantitatively the underlying noise/uncertainties in the experimental measurement of Tg.
ISSN:0965-0393
1361-651X
DOI:10.1088/1361-651X/aaf8ca