AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation

1. Introduction The AdaptFerm dataset is designed to support the development of a monitoring framework for lactic acid production fermentation using Fourier Transform Infrared (FTIR) spectroscopy. Its primary goal is to facilitate the control strategies for continuous fermentation processes to max...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Arefi, Arman, Babor, Majharulislam, Liu, Shanghua, Höhne, Marina M.-C., Sturm, Barbara, Gómez, Pablo López, Venus, Joachim, Olszewska-Widdrat, Agata
Format:	Dataset
Sprache:	eng
Schlagworte:	Agricultural biotechnology Bio-Waste Bioprocessing technologies Data Analysis Deep learning Domain Adaptation Environmental biotechnology Fermentation FOS: Agricultural biotechnology FOS: Environmental biotechnology FOS: Industrial biotechnology FTIR Glucose Industrial biotechnology Lactic Acid Machine Learning Out-of-Distribution Polylactic acid Process Monitoring Regression Spectra Spectroscopy Transfer learning Unsupervised learning Waste treatment processes
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Arefi, Arman Babor, Majharulislam Liu, Shanghua Höhne, Marina M.-C. Sturm, Barbara Gómez, Pablo López Venus, Joachim Olszewska-Widdrat, Agata
description	1. Introduction The AdaptFerm dataset is designed to support the development of a monitoring framework for lactic acid production fermentation using Fourier Transform Infrared (FTIR) spectroscopy. Its primary goal is to facilitate the control strategies for continuous fermentation processes to maximize the lactic acid production. The AdaptFerm encompasses data from two distinct batch fermentation environments: one employing simple sugar (glucose) as the substrate and the other utilizing complex sugars derived from bio-waste. The study focuses on developing accurate predictive models for glucose and lactic acid concentrations, with an emphasis on applying classical machine learning techniques and enhancing domain generalization capabilities. 2. Prediction Model for Different Substrate Environments The chemical composition of substrates are presented in Table 1 [1]. The dataset is utilized to train and test models within the same substrate domain. For instance, data from a single fermentation environment (e.g., glucose substrate) is used for both training and testing phases. The applied machine learning models showed accurate prediction within the same domain [1]. For more details on the methods applied, please refer to the following link: https://doi.org/10.1016/j.heliyon.2024.e38791. In this study, the MIR results correspond to the AdaptFerm dataset. The spectra of the glucose and biowaste hydrolysate fermentation process are presented in Figure 3 and Figure 4. 3. Domain Adaptation The dataset was also used to address the challenge posed by shifts in FTIR data when substrates change. Transitioning from simple sugar (glucose) to complex sugar (bio-waste) causes significant variations in the FTIR spectra, making it difficult for models trained on glucose fermentation data to maintain prediction accuracy in the complex sugar fermentation environment. This results in reduced robustness and performance when applied to out-of-distribution data. To address these challenges, we explore methods that improve the generalization ability and robustness of models in such scenarios without using labels from complex sugar fermentation [2]. It shows the application of machine learning interpretation to find domain invariant features for glucose and lactic acid. For more details on the methods applied, please refer to the following link: https://dx.doi.org/10.2139/ssrn.5012080. The code is available at https://github.com/shl-shawn/ShapFS. 4. Real-World Use Cases 4.1. Regr
doi_str_mv	10.5281/zenodo.14171427
format	Dataset
fullrecord	<record><control><sourceid>datacite_PQ8</sourceid><recordid>TN_cdi_datacite_primary_10_5281_zenodo_14171427</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_5281_zenodo_14171427</sourcerecordid><originalsourceid>FETCH-datacite_primary_10_5281_zenodo_141714273</originalsourceid><addsrcrecordid>eNqVjjtuAkEQRCchsIDYaV-AhcEg0Gb8VhCQ2BCPmt1ZaImdHk03AZyen30AJ1VJPdUz5tMOsvFwavs3H7jizI7sxI6Gkw_DswqjFj41OcyJY-LSi8CWAyknCkfYyzOL3eYbJPpSE0vJ8ZrDJggdTypAQRl-LgfRhOphVdePmQCGCpbcIAV4naASh45p1XgW3_3ttukXq91i3atQsST1LiZqMF2dHbinsnsruz_lr_8Td5yQUsk</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>dataset</recordtype></control><display><type>dataset</type><title>AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation</title><source>DataCite</source><creator>Arefi, Arman ; Babor, Majharulislam ; Liu, Shanghua ; Höhne, Marina M.-C. ; Sturm, Barbara ; Gómez, Pablo López ; Venus, Joachim ; Olszewska-Widdrat, Agata</creator><creatorcontrib>Arefi, Arman ; Babor, Majharulislam ; Liu, Shanghua ; Höhne, Marina M.-C. ; Sturm, Barbara ; Gómez, Pablo López ; Venus, Joachim ; Olszewska-Widdrat, Agata</creatorcontrib><description> 1. Introduction The AdaptFerm dataset is designed to support the development of a monitoring framework for lactic acid production fermentation using Fourier Transform Infrared (FTIR) spectroscopy. Its primary goal is to facilitate the control strategies for continuous fermentation processes to maximize the lactic acid production. The AdaptFerm encompasses data from two distinct batch fermentation environments: one employing simple sugar (glucose) as the substrate and the other utilizing complex sugars derived from bio-waste. The study focuses on developing accurate predictive models for glucose and lactic acid concentrations, with an emphasis on applying classical machine learning techniques and enhancing domain generalization capabilities. 2. Prediction Model for Different Substrate Environments The chemical composition of substrates are presented in Table 1 [1]. The dataset is utilized to train and test models within the same substrate domain. For instance, data from a single fermentation environment (e.g., glucose substrate) is used for both training and testing phases. The applied machine learning models showed accurate prediction within the same domain [1]. For more details on the methods applied, please refer to the following link: https://doi.org/10.1016/j.heliyon.2024.e38791. In this study, the MIR results correspond to the AdaptFerm dataset. The spectra of the glucose and biowaste hydrolysate fermentation process are presented in Figure 3 and Figure 4. 3. Domain Adaptation The dataset was also used to address the challenge posed by shifts in FTIR data when substrates change. Transitioning from simple sugar (glucose) to complex sugar (bio-waste) causes significant variations in the FTIR spectra, making it difficult for models trained on glucose fermentation data to maintain prediction accuracy in the complex sugar fermentation environment. This results in reduced robustness and performance when applied to out-of-distribution data. To address these challenges, we explore methods that improve the generalization ability and robustness of models in such scenarios without using labels from complex sugar fermentation [2]. It shows the application of machine learning interpretation to find domain invariant features for glucose and lactic acid. For more details on the methods applied, please refer to the following link: https://dx.doi.org/10.2139/ssrn.5012080. The code is available at https://github.com/shl-shawn/ShapFS. 4. Real-World Use Cases 4.1. Regression Task AdaptFerm serves as a benchmark for machine learning model applications in fermentation processes, specifically for predicting glucose and lactic acid concentrations, measured in g/L (grams per liter), while considering issues of out-of-distribution generalization. 4.2. Domain Adaptation Regression Task The dataset is also suitable for evaluating different domain adaptation methods. In particular, the glucose substrate fermentation data can be used as the source domain, while the complex sugar fermentation data from bio-waste serves as the target domain. For semi-supervised domain adaptation approaches, it is recommended to use the initial data points (i.e., those collected at the beginning of the fermentation process) from the target domain, as the dataset is organized chronologically by collection day. These approaches aim to improve the robustness of models by transferring knowledge across domains and mitigating the effects of out-of-distribution data. 4.3. Anomaly Detection The dataset can be used to train anomaly detection models to identify outliers or deviations from normal fermentation behavior. This could be valuable in industrial bioprocessing, where early detection of issues like contamination or process failure is crucial. Techniques like Isolation Forests, One-Class SVM, or Autoencoders could be applied to identify unusual patterns in FTIR spectra. 4.4. Classification Task Although the main task is regression, the dataset could also be used in classification tasks by discretizing the concentrations of glucose and lactic acid into categories (e.g., low, medium, high). This would allow for the application of classification algorithms like Support Vector Machines (SVM), Random Forests, or Neural Networks for predicting the fermentation phase or identifying specific operational conditions. 4.5. Transfer Learning Given the nature of the domain adaptation approach in this dataset, transfer learning models can be explored. Models pre-trained on glucose fermentation data can be fine-tuned on complex sugar fermentation data, enabling quicker model convergence and improved performance in data-scarce environments. 4.6. Multi-Task Learning In a multi-task learning scenario, models could simultaneously predict both glucose and lactic acid concentrations from the same FTIR data. This could help in improving model accuracy by leveraging shared representations across the two tasks. 4.7. Feature Selection The FTIR spectral data contains a large number of features (wavelengths), and feature selection techniques such as Recursive Feature Elimination (RFE), Lasso regression, or mutual information could be applied to identify the most relevant wavelengths for predicting glucose and lactic acid concentrations, improving model performance and interpretability. 5. Dataset Structure and Meta Information The dataset is organized into four Excel files, corresponding to two main fermentation domains (different substrates) and two key process variables: a) Simple Sugar SubstrateThis domain contains data for the fermentation process using glucose as the substrate to produce lactic acid. It includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. b) Complex Sugar SubstrateThis doman contains data for the fermentation process using bio-waste as the substrate to produce lactic acid. Similar to the previous domain, it includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. Each file is structured as follows: The first column contains the sample ID, which serves as the timeline of measurements (Sample ID 1 represents the first measurement in the fermentation process). From the second column onwards, the FTIR data is provided, covering the spectral range from 549.6 cm-1 to 3999.6 cm-1 comprising 3,579 features. The final column contains the ground truth data, the chemical measurements of fermentation variables such as glucose and lactic acid concentrations, both measured in g/L. 6. Conclusion The AdaptFerm features FTIR spectra data from two distinct fermentation environments: simple sugar (glucose) and complex sugar (bio-waste). The dataset is designed to be used in regression tasks, including domain adaptation, and can be applied in machine learning model development for fermentation process monitoring, with a focus on enhancing model robustness and handling out-of-distribution data. This dataset provides a valuable resource for exploring domain shift and improving the robustness of machine learning models in bioengineering and fermentation processes. It enables further research into domain generalization techniques and offers a wide range of possibilities for machine learning applications. References [1] Arman Arefi, Barbara Sturm, Majharulislam Babor, Michael Horf, Thomas Hoffmann, Marina Höhne, Kathleen Friedrich, Linda Schroedter, Joachim Venus, Agata Olszewska-Widdrat, Digital model of biochemical reactions in lactic acid bacterial fermentation of simple glucose and biowaste substrates, Heliyon, Volume 10, Issue 19, 2024, e38791, ISSN 2405-8440, DOI: 10.1016/j.heliyon.2024.e38791, https://doi.org/10.1016/j.heliyon.2024.e38791. [2] Majharulislam Babor, Shanghua Liu, Arman Arefi, Agata Olszewska-Widdrat, Barbara Sturm, Joachim Venus, and Marina M.-C. Höhne, Domain-Invariant Monitoring for Lactic Acid Production: Transfer Learning from Glucose to Bio-Waste Using Machine Learning Interpretation. Available at http://dx.doi.org/10.2139/ssrn.5012080.</description><identifier>DOI: 10.5281/zenodo.14171427</identifier><language>eng</language><publisher>Zenodo</publisher><subject>Agricultural biotechnology ; Bio-Waste ; Bioprocessing technologies ; Data Analysis ; Deep learning ; Domain Adaptation ; Environmental biotechnology ; Fermentation ; FOS: Agricultural biotechnology ; FOS: Environmental biotechnology ; FOS: Industrial biotechnology ; FTIR ; Glucose ; Industrial biotechnology ; Lactic Acid ; Machine Learning ; Out-of-Distribution ; Polylactic acid ; Process Monitoring ; Regression ; Spectra ; Spectroscopy ; Transfer learning ; Unsupervised learning ; Waste treatment processes</subject><creationdate>2024</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0001-7708-1783 ; 0000-0002-5440-7573 ; 0009-0009-9855-9040 ; 0000-0002-7843-6846 ; 0000-0002-3247-3019</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,1888</link.rule.ids><linktorsrc>$$Uhttps://commons.datacite.org/doi.org/10.5281/zenodo.14171427$$EView_record_in_DataCite.org$$FView_record_in_$$GDataCite.org$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>Arefi, Arman</creatorcontrib><creatorcontrib>Babor, Majharulislam</creatorcontrib><creatorcontrib>Liu, Shanghua</creatorcontrib><creatorcontrib>Höhne, Marina M.-C.</creatorcontrib><creatorcontrib>Sturm, Barbara</creatorcontrib><creatorcontrib>Gómez, Pablo López</creatorcontrib><creatorcontrib>Venus, Joachim</creatorcontrib><creatorcontrib>Olszewska-Widdrat, Agata</creatorcontrib><title>AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation</title><description> 1. Introduction The AdaptFerm dataset is designed to support the development of a monitoring framework for lactic acid production fermentation using Fourier Transform Infrared (FTIR) spectroscopy. Its primary goal is to facilitate the control strategies for continuous fermentation processes to maximize the lactic acid production. The AdaptFerm encompasses data from two distinct batch fermentation environments: one employing simple sugar (glucose) as the substrate and the other utilizing complex sugars derived from bio-waste. The study focuses on developing accurate predictive models for glucose and lactic acid concentrations, with an emphasis on applying classical machine learning techniques and enhancing domain generalization capabilities. 2. Prediction Model for Different Substrate Environments The chemical composition of substrates are presented in Table 1 [1]. The dataset is utilized to train and test models within the same substrate domain. For instance, data from a single fermentation environment (e.g., glucose substrate) is used for both training and testing phases. The applied machine learning models showed accurate prediction within the same domain [1]. For more details on the methods applied, please refer to the following link: https://doi.org/10.1016/j.heliyon.2024.e38791. In this study, the MIR results correspond to the AdaptFerm dataset. The spectra of the glucose and biowaste hydrolysate fermentation process are presented in Figure 3 and Figure 4. 3. Domain Adaptation The dataset was also used to address the challenge posed by shifts in FTIR data when substrates change. Transitioning from simple sugar (glucose) to complex sugar (bio-waste) causes significant variations in the FTIR spectra, making it difficult for models trained on glucose fermentation data to maintain prediction accuracy in the complex sugar fermentation environment. This results in reduced robustness and performance when applied to out-of-distribution data. To address these challenges, we explore methods that improve the generalization ability and robustness of models in such scenarios without using labels from complex sugar fermentation [2]. It shows the application of machine learning interpretation to find domain invariant features for glucose and lactic acid. For more details on the methods applied, please refer to the following link: https://dx.doi.org/10.2139/ssrn.5012080. The code is available at https://github.com/shl-shawn/ShapFS. 4. Real-World Use Cases 4.1. Regression Task AdaptFerm serves as a benchmark for machine learning model applications in fermentation processes, specifically for predicting glucose and lactic acid concentrations, measured in g/L (grams per liter), while considering issues of out-of-distribution generalization. 4.2. Domain Adaptation Regression Task The dataset is also suitable for evaluating different domain adaptation methods. In particular, the glucose substrate fermentation data can be used as the source domain, while the complex sugar fermentation data from bio-waste serves as the target domain. For semi-supervised domain adaptation approaches, it is recommended to use the initial data points (i.e., those collected at the beginning of the fermentation process) from the target domain, as the dataset is organized chronologically by collection day. These approaches aim to improve the robustness of models by transferring knowledge across domains and mitigating the effects of out-of-distribution data. 4.3. Anomaly Detection The dataset can be used to train anomaly detection models to identify outliers or deviations from normal fermentation behavior. This could be valuable in industrial bioprocessing, where early detection of issues like contamination or process failure is crucial. Techniques like Isolation Forests, One-Class SVM, or Autoencoders could be applied to identify unusual patterns in FTIR spectra. 4.4. Classification Task Although the main task is regression, the dataset could also be used in classification tasks by discretizing the concentrations of glucose and lactic acid into categories (e.g., low, medium, high). This would allow for the application of classification algorithms like Support Vector Machines (SVM), Random Forests, or Neural Networks for predicting the fermentation phase or identifying specific operational conditions. 4.5. Transfer Learning Given the nature of the domain adaptation approach in this dataset, transfer learning models can be explored. Models pre-trained on glucose fermentation data can be fine-tuned on complex sugar fermentation data, enabling quicker model convergence and improved performance in data-scarce environments. 4.6. Multi-Task Learning In a multi-task learning scenario, models could simultaneously predict both glucose and lactic acid concentrations from the same FTIR data. This could help in improving model accuracy by leveraging shared representations across the two tasks. 4.7. Feature Selection The FTIR spectral data contains a large number of features (wavelengths), and feature selection techniques such as Recursive Feature Elimination (RFE), Lasso regression, or mutual information could be applied to identify the most relevant wavelengths for predicting glucose and lactic acid concentrations, improving model performance and interpretability. 5. Dataset Structure and Meta Information The dataset is organized into four Excel files, corresponding to two main fermentation domains (different substrates) and two key process variables: a) Simple Sugar SubstrateThis domain contains data for the fermentation process using glucose as the substrate to produce lactic acid. It includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. b) Complex Sugar SubstrateThis doman contains data for the fermentation process using bio-waste as the substrate to produce lactic acid. Similar to the previous domain, it includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. Each file is structured as follows: The first column contains the sample ID, which serves as the timeline of measurements (Sample ID 1 represents the first measurement in the fermentation process). From the second column onwards, the FTIR data is provided, covering the spectral range from 549.6 cm-1 to 3999.6 cm-1 comprising 3,579 features. The final column contains the ground truth data, the chemical measurements of fermentation variables such as glucose and lactic acid concentrations, both measured in g/L. 6. Conclusion The AdaptFerm features FTIR spectra data from two distinct fermentation environments: simple sugar (glucose) and complex sugar (bio-waste). The dataset is designed to be used in regression tasks, including domain adaptation, and can be applied in machine learning model development for fermentation process monitoring, with a focus on enhancing model robustness and handling out-of-distribution data. This dataset provides a valuable resource for exploring domain shift and improving the robustness of machine learning models in bioengineering and fermentation processes. It enables further research into domain generalization techniques and offers a wide range of possibilities for machine learning applications. References [1] Arman Arefi, Barbara Sturm, Majharulislam Babor, Michael Horf, Thomas Hoffmann, Marina Höhne, Kathleen Friedrich, Linda Schroedter, Joachim Venus, Agata Olszewska-Widdrat, Digital model of biochemical reactions in lactic acid bacterial fermentation of simple glucose and biowaste substrates, Heliyon, Volume 10, Issue 19, 2024, e38791, ISSN 2405-8440, DOI: 10.1016/j.heliyon.2024.e38791, https://doi.org/10.1016/j.heliyon.2024.e38791. [2] Majharulislam Babor, Shanghua Liu, Arman Arefi, Agata Olszewska-Widdrat, Barbara Sturm, Joachim Venus, and Marina M.-C. Höhne, Domain-Invariant Monitoring for Lactic Acid Production: Transfer Learning from Glucose to Bio-Waste Using Machine Learning Interpretation. Available at http://dx.doi.org/10.2139/ssrn.5012080.</description><subject>Agricultural biotechnology</subject><subject>Bio-Waste</subject><subject>Bioprocessing technologies</subject><subject>Data Analysis</subject><subject>Deep learning</subject><subject>Domain Adaptation</subject><subject>Environmental biotechnology</subject><subject>Fermentation</subject><subject>FOS: Agricultural biotechnology</subject><subject>FOS: Environmental biotechnology</subject><subject>FOS: Industrial biotechnology</subject><subject>FTIR</subject><subject>Glucose</subject><subject>Industrial biotechnology</subject><subject>Lactic Acid</subject><subject>Machine Learning</subject><subject>Out-of-Distribution</subject><subject>Polylactic acid</subject><subject>Process Monitoring</subject><subject>Regression</subject><subject>Spectra</subject><subject>Spectroscopy</subject><subject>Transfer learning</subject><subject>Unsupervised learning</subject><subject>Waste treatment processes</subject><fulltext>true</fulltext><rsrctype>dataset</rsrctype><creationdate>2024</creationdate><recordtype>dataset</recordtype><sourceid>PQ8</sourceid><recordid>eNqVjjtuAkEQRCchsIDYaV-AhcEg0Gb8VhCQ2BCPmt1ZaImdHk03AZyen30AJ1VJPdUz5tMOsvFwavs3H7jizI7sxI6Gkw_DswqjFj41OcyJY-LSi8CWAyknCkfYyzOL3eYbJPpSE0vJ8ZrDJggdTypAQRl-LgfRhOphVdePmQCGCpbcIAV4naASh45p1XgW3_3ttukXq91i3atQsST1LiZqMF2dHbinsnsruz_lr_8Td5yQUsk</recordid><startdate>20241115</startdate><enddate>20241115</enddate><creator>Arefi, Arman</creator><creator>Babor, Majharulislam</creator><creator>Liu, Shanghua</creator><creator>Höhne, Marina M.-C.</creator><creator>Sturm, Barbara</creator><creator>Gómez, Pablo López</creator><creator>Venus, Joachim</creator><creator>Olszewska-Widdrat, Agata</creator><general>Zenodo</general><scope>DYCCY</scope><scope>PQ8</scope><orcidid>https://orcid.org/0000-0001-7708-1783</orcidid><orcidid>https://orcid.org/0000-0002-5440-7573</orcidid><orcidid>https://orcid.org/0009-0009-9855-9040</orcidid><orcidid>https://orcid.org/0000-0002-7843-6846</orcidid><orcidid>https://orcid.org/0000-0002-3247-3019</orcidid></search><sort><creationdate>20241115</creationdate><title>AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation</title><author>Arefi, Arman ; Babor, Majharulislam ; Liu, Shanghua ; Höhne, Marina M.-C. ; Sturm, Barbara ; Gómez, Pablo López ; Venus, Joachim ; Olszewska-Widdrat, Agata</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-datacite_primary_10_5281_zenodo_141714273</frbrgroupid><rsrctype>datasets</rsrctype><prefilter>datasets</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Agricultural biotechnology</topic><topic>Bio-Waste</topic><topic>Bioprocessing technologies</topic><topic>Data Analysis</topic><topic>Deep learning</topic><topic>Domain Adaptation</topic><topic>Environmental biotechnology</topic><topic>Fermentation</topic><topic>FOS: Agricultural biotechnology</topic><topic>FOS: Environmental biotechnology</topic><topic>FOS: Industrial biotechnology</topic><topic>FTIR</topic><topic>Glucose</topic><topic>Industrial biotechnology</topic><topic>Lactic Acid</topic><topic>Machine Learning</topic><topic>Out-of-Distribution</topic><topic>Polylactic acid</topic><topic>Process Monitoring</topic><topic>Regression</topic><topic>Spectra</topic><topic>Spectroscopy</topic><topic>Transfer learning</topic><topic>Unsupervised learning</topic><topic>Waste treatment processes</topic><toplevel>online_resources</toplevel><creatorcontrib>Arefi, Arman</creatorcontrib><creatorcontrib>Babor, Majharulislam</creatorcontrib><creatorcontrib>Liu, Shanghua</creatorcontrib><creatorcontrib>Höhne, Marina M.-C.</creatorcontrib><creatorcontrib>Sturm, Barbara</creatorcontrib><creatorcontrib>Gómez, Pablo López</creatorcontrib><creatorcontrib>Venus, Joachim</creatorcontrib><creatorcontrib>Olszewska-Widdrat, Agata</creatorcontrib><collection>DataCite (Open Access)</collection><collection>DataCite</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Arefi, Arman</au><au>Babor, Majharulislam</au><au>Liu, Shanghua</au><au>Höhne, Marina M.-C.</au><au>Sturm, Barbara</au><au>Gómez, Pablo López</au><au>Venus, Joachim</au><au>Olszewska-Widdrat, Agata</au><format>book</format><genre>unknown</genre><ristype>DATA</ristype><title>AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation</title><date>2024-11-15</date><risdate>2024</risdate><abstract> 1. Introduction The AdaptFerm dataset is designed to support the development of a monitoring framework for lactic acid production fermentation using Fourier Transform Infrared (FTIR) spectroscopy. Its primary goal is to facilitate the control strategies for continuous fermentation processes to maximize the lactic acid production. The AdaptFerm encompasses data from two distinct batch fermentation environments: one employing simple sugar (glucose) as the substrate and the other utilizing complex sugars derived from bio-waste. The study focuses on developing accurate predictive models for glucose and lactic acid concentrations, with an emphasis on applying classical machine learning techniques and enhancing domain generalization capabilities. 2. Prediction Model for Different Substrate Environments The chemical composition of substrates are presented in Table 1 [1]. The dataset is utilized to train and test models within the same substrate domain. For instance, data from a single fermentation environment (e.g., glucose substrate) is used for both training and testing phases. The applied machine learning models showed accurate prediction within the same domain [1]. For more details on the methods applied, please refer to the following link: https://doi.org/10.1016/j.heliyon.2024.e38791. In this study, the MIR results correspond to the AdaptFerm dataset. The spectra of the glucose and biowaste hydrolysate fermentation process are presented in Figure 3 and Figure 4. 3. Domain Adaptation The dataset was also used to address the challenge posed by shifts in FTIR data when substrates change. Transitioning from simple sugar (glucose) to complex sugar (bio-waste) causes significant variations in the FTIR spectra, making it difficult for models trained on glucose fermentation data to maintain prediction accuracy in the complex sugar fermentation environment. This results in reduced robustness and performance when applied to out-of-distribution data. To address these challenges, we explore methods that improve the generalization ability and robustness of models in such scenarios without using labels from complex sugar fermentation [2]. It shows the application of machine learning interpretation to find domain invariant features for glucose and lactic acid. For more details on the methods applied, please refer to the following link: https://dx.doi.org/10.2139/ssrn.5012080. The code is available at https://github.com/shl-shawn/ShapFS. 4. Real-World Use Cases 4.1. Regression Task AdaptFerm serves as a benchmark for machine learning model applications in fermentation processes, specifically for predicting glucose and lactic acid concentrations, measured in g/L (grams per liter), while considering issues of out-of-distribution generalization. 4.2. Domain Adaptation Regression Task The dataset is also suitable for evaluating different domain adaptation methods. In particular, the glucose substrate fermentation data can be used as the source domain, while the complex sugar fermentation data from bio-waste serves as the target domain. For semi-supervised domain adaptation approaches, it is recommended to use the initial data points (i.e., those collected at the beginning of the fermentation process) from the target domain, as the dataset is organized chronologically by collection day. These approaches aim to improve the robustness of models by transferring knowledge across domains and mitigating the effects of out-of-distribution data. 4.3. Anomaly Detection The dataset can be used to train anomaly detection models to identify outliers or deviations from normal fermentation behavior. This could be valuable in industrial bioprocessing, where early detection of issues like contamination or process failure is crucial. Techniques like Isolation Forests, One-Class SVM, or Autoencoders could be applied to identify unusual patterns in FTIR spectra. 4.4. Classification Task Although the main task is regression, the dataset could also be used in classification tasks by discretizing the concentrations of glucose and lactic acid into categories (e.g., low, medium, high). This would allow for the application of classification algorithms like Support Vector Machines (SVM), Random Forests, or Neural Networks for predicting the fermentation phase or identifying specific operational conditions. 4.5. Transfer Learning Given the nature of the domain adaptation approach in this dataset, transfer learning models can be explored. Models pre-trained on glucose fermentation data can be fine-tuned on complex sugar fermentation data, enabling quicker model convergence and improved performance in data-scarce environments. 4.6. Multi-Task Learning In a multi-task learning scenario, models could simultaneously predict both glucose and lactic acid concentrations from the same FTIR data. This could help in improving model accuracy by leveraging shared representations across the two tasks. 4.7. Feature Selection The FTIR spectral data contains a large number of features (wavelengths), and feature selection techniques such as Recursive Feature Elimination (RFE), Lasso regression, or mutual information could be applied to identify the most relevant wavelengths for predicting glucose and lactic acid concentrations, improving model performance and interpretability. 5. Dataset Structure and Meta Information The dataset is organized into four Excel files, corresponding to two main fermentation domains (different substrates) and two key process variables: a) Simple Sugar SubstrateThis domain contains data for the fermentation process using glucose as the substrate to produce lactic acid. It includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. b) Complex Sugar SubstrateThis doman contains data for the fermentation process using bio-waste as the substrate to produce lactic acid. Similar to the previous domain, it includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. Each file is structured as follows: The first column contains the sample ID, which serves as the timeline of measurements (Sample ID 1 represents the first measurement in the fermentation process). From the second column onwards, the FTIR data is provided, covering the spectral range from 549.6 cm-1 to 3999.6 cm-1 comprising 3,579 features. The final column contains the ground truth data, the chemical measurements of fermentation variables such as glucose and lactic acid concentrations, both measured in g/L. 6. Conclusion The AdaptFerm features FTIR spectra data from two distinct fermentation environments: simple sugar (glucose) and complex sugar (bio-waste). The dataset is designed to be used in regression tasks, including domain adaptation, and can be applied in machine learning model development for fermentation process monitoring, with a focus on enhancing model robustness and handling out-of-distribution data. This dataset provides a valuable resource for exploring domain shift and improving the robustness of machine learning models in bioengineering and fermentation processes. It enables further research into domain generalization techniques and offers a wide range of possibilities for machine learning applications. References [1] Arman Arefi, Barbara Sturm, Majharulislam Babor, Michael Horf, Thomas Hoffmann, Marina Höhne, Kathleen Friedrich, Linda Schroedter, Joachim Venus, Agata Olszewska-Widdrat, Digital model of biochemical reactions in lactic acid bacterial fermentation of simple glucose and biowaste substrates, Heliyon, Volume 10, Issue 19, 2024, e38791, ISSN 2405-8440, DOI: 10.1016/j.heliyon.2024.e38791, https://doi.org/10.1016/j.heliyon.2024.e38791. [2] Majharulislam Babor, Shanghua Liu, Arman Arefi, Agata Olszewska-Widdrat, Barbara Sturm, Joachim Venus, and Marina M.-C. Höhne, Domain-Invariant Monitoring for Lactic Acid Production: Transfer Learning from Glucose to Bio-Waste Using Machine Learning Interpretation. Available at http://dx.doi.org/10.2139/ssrn.5012080.</abstract><pub>Zenodo</pub><doi>10.5281/zenodo.14171427</doi><orcidid>https://orcid.org/0000-0001-7708-1783</orcidid><orcidid>https://orcid.org/0000-0002-5440-7573</orcidid><orcidid>https://orcid.org/0009-0009-9855-9040</orcidid><orcidid>https://orcid.org/0000-0002-7843-6846</orcidid><orcidid>https://orcid.org/0000-0002-3247-3019</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.5281/zenodo.14171427
ispartof
issn
language	eng
recordid	cdi_datacite_primary_10_5281_zenodo_14171427
source	DataCite
subjects	Agricultural biotechnology Bio-Waste Bioprocessing technologies Data Analysis Deep learning Domain Adaptation Environmental biotechnology Fermentation FOS: Agricultural biotechnology FOS: Environmental biotechnology FOS: Industrial biotechnology FTIR Glucose Industrial biotechnology Lactic Acid Machine Learning Out-of-Distribution Polylactic acid Process Monitoring Regression Spectra Spectroscopy Transfer learning Unsupervised learning Waste treatment processes
title	AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T22%3A57%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-datacite_PQ8&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=unknown&rft.au=Arefi,%20Arman&rft.date=2024-11-15&rft_id=info:doi/10.5281/zenodo.14171427&rft_dat=%3Cdatacite_PQ8%3E10_5281_zenodo_14171427%3C/datacite_PQ8%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true