Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model
While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, wh...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | While various models and computational tools have been proposed for structure
and property analysis of molecules, generating molecules that conform to all
desired structures and properties remains a challenge. Here, we introduce a
multi-constraint molecular generation large language model, TSMMG, which, akin
to a student, incorporates knowledge from various small models and tools,
namely, the 'teachers'. To train TSMMG, we construct a large set of
text-molecule pairs by extracting molecular knowledge from these 'teachers',
enabling it to generate novel molecules that conform to the descriptions
through various text prompts. We experimentally show that TSMMG remarkably
performs in generating molecules meeting complex, natural language-described
property requirements across two-, three-, and four-constraint tasks, with an
average molecular validity of over 99% and success ratio of 82.58%, 68.03%, and
67.48%, respectively. The model also exhibits adaptability through zero-shot
testing, creating molecules that satisfy combinations of properties that have
not been encountered. It can comprehend text inputs with various language
styles, extending beyond the confines of outlined prompts, as confirmed through
empirical validation. Additionally, the knowledge distillation feature of TSMMG
contributes to the continuous enhancement of small models, while the innovative
approach to dataset construction effectively addresses the issues of data
scarcity and quality, which positions TSMMG as a promising tool in the domains
of drug discovery and materials science. |
---|---|
DOI: | 10.48550/arxiv.2403.13244 |