Data augmentation and machine learning techniques for control strategy development in bio-polymerization process

Machine learning has been increasingly used in biochemistry. However, in organic chemistry and other experiment-based fields, data collected from real experiments are inadequate and the current coronavirus disease (COVID-19) pandemic has made the situation even worse. Such limited data resources may...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Environmental science and ecotechnology 2022-07, Vol.11, p.100172-100172, Article 100172
Hauptverfasser:	Wei, Sizhou, Chen, Zhiyuan, Arumugasamy, Senthil Kumar, Chew, Irene Mei Leng
Format:	Artikel
Sprache:	eng
Schlagworte:	Artificial neural network Bio-polymerization Original Research Random forest Variational autoencoder generative adversarial network
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Machine learning has been increasingly used in biochemistry. However, in organic chemistry and other experiment-based fields, data collected from real experiments are inadequate and the current coronavirus disease (COVID-19) pandemic has made the situation even worse. Such limited data resources may result in the low performance of modeling and affect the proper development of a control strategy. This paper proposes a feasible machine learning solution to the problem of small sample size in the bio-polymerization process. To avoid overfitting, the variational auto-encoder and generative adversarial network algorithms are used for data augmentation. The random forest and artificial neural network algorithms are implemented in the modeling process. The results prove that data augmentation techniques effectively improve the performance of the regression model. Several machine learning models were compared and the experimental results show that the random forest model with data augmentation by the generative adversarial network technique achieved the best performance in predicting the molecular weight on the training set (with an R2 of 0.94) and on the test set (with an R2 of 0.74), and the coefficient of determination of this model was 0.74. [Display omitted] •Data augmentation was used in bio-polymerization process for insufficient dataset.•VAE and GAN algorithms were implemented for data augmentation to avoid overfitting.•RF model displayed superiority than ANN model for control strategy development.•RF model with GAN achieved the best performance with R2 of 0.94•The developed model can be used as a benchmark for other reaction systems.
ISSN:	2666-4984 2096-9643 2666-4984
DOI:	10.1016/j.ese.2022.100172