Exploiting redundancy in large materials datasets for efficient machine learning with less data
Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95% of data can be safely removed from machin...
Gespeichert in:
Veröffentlicht in: | Nature communications 2023-11, Vol.14 (1), p.7283-10, Article 7283 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95% of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-distribution samples. In addition, we show that uncertainty-based active learning algorithms can construct much smaller but equally informative datasets. We discuss the effectiveness of informative data in improving prediction performance and robustness and provide insights into efficient data acquisition and machine learning training. This work challenges the “bigger is better” mentality and calls for attention to the information richness of materials data rather than a narrow emphasis on data volume.
Big data is crucial for machine learning, but the redundancies in the datasets are rarely studied. Here the authors reveal significant redundancy in large materials datasets, showing that up to 95% of data can be removed without impacting prediction accuracy. |
---|---|
ISSN: | 2041-1723 2041-1723 |
DOI: | 10.1038/s41467-023-42992-y |