The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small data...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Communications chemistry 2024-06, Vol.7 (1), p.134-11
Hauptverfasser:	Snyder, Scott H., Vignaux, Patricia A., Ozalp, Mustafa Kemal, Gerlach, Jacob, Puhl, Ana C., Lane, Thomas R., Corbett, John, Urbina, Fabio, Ekins, Sean
Format:	Artikel
Sprache:	eng
Schlagworte:	631/154/433 639/638/309/2144 639/638/309/630 Algorithms Chemistry Chemistry and Materials Science Chemistry/Food Science Datasets Image analysis Large language models Machine learning Prediction models Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Recent advances in machine learning (ML) have led to newer model architectures including transformers (large language models, LLMs) showing state of the art results in text generation and image analysis as well as few-shot learning (FSLC) models which offer predictive power with extremely small datasets. These new architectures may offer promise, yet the ‘no-free lunch’ theorem suggests that no single model algorithm can outperform at all possible tasks. Here, we explore the capabilities of classical (SVR), FSLC, and transformer models (MolBART) over a range of dataset tasks and show a ‘goldilocks zone’ for each model type, in which dataset size and feature distribution (i.e. dataset “diversity”) determines the optimal algorithm strategy. When datasets are small (
ISSN:	2399-3669 2399-3669
DOI:	10.1038/s42004-024-01220-4