AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation

  1. Introduction The AdaptFerm dataset is designed to support the development of a monitoring framework for lactic acid production fermentation using Fourier Transform Infrared (FTIR) spectroscopy. Its primary goal is to facilitate the control strategies for continuous fermentation processes to max...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Arefi, Arman, Babor, Majharulislam, Liu, Shanghua, Höhne, Marina M.-C., Sturm, Barbara, Gómez, Pablo López, Venus, Joachim, Olszewska-Widdrat, Agata
Format: Dataset
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Arefi, Arman
Babor, Majharulislam
Liu, Shanghua
Höhne, Marina M.-C.
Sturm, Barbara
Gómez, Pablo López
Venus, Joachim
Olszewska-Widdrat, Agata
description   1. Introduction The AdaptFerm dataset is designed to support the development of a monitoring framework for lactic acid production fermentation using Fourier Transform Infrared (FTIR) spectroscopy. Its primary goal is to facilitate the control strategies for continuous fermentation processes to maximize the lactic acid production. The AdaptFerm encompasses data from two distinct batch fermentation environments: one employing simple sugar (glucose) as the substrate and the other utilizing complex sugars derived from bio-waste. The study focuses on developing accurate predictive models for glucose and lactic acid concentrations, with an emphasis on applying classical machine learning techniques and enhancing domain generalization capabilities. 2. Prediction Model for Different Substrate Environments The chemical composition of substrates are presented in Table 1 [1]. The dataset is utilized to train and test models within the same substrate domain. For instance, data from a single fermentation environment (e.g., glucose substrate) is used for both training and testing phases. The applied machine learning models showed accurate prediction within the same domain [1]. For more details on the methods applied, please refer to the following link: https://doi.org/10.1016/j.heliyon.2024.e38791. In this study, the MIR results correspond to the AdaptFerm dataset. The spectra of the glucose and biowaste hydrolysate fermentation process are presented in Figure 3 and Figure 4. 3. Domain Adaptation The dataset was also used to address the challenge posed by shifts in FTIR data when substrates change. Transitioning from simple sugar (glucose) to complex sugar (bio-waste) causes significant variations in the FTIR spectra, making it difficult for models trained on glucose fermentation data to maintain prediction accuracy in the complex sugar fermentation environment. This results in reduced robustness and performance when applied to out-of-distribution data. To address these challenges, we explore methods that improve the generalization ability and robustness of models in such scenarios without using labels from complex sugar fermentation [2]. It shows the application of machine learning interpretation to find domain invariant features for glucose and lactic acid. For more details on the methods applied, please refer to the following link: https://dx.doi.org/10.2139/ssrn.5012080. The code is available at https://github.com/shl-shawn/ShapFS. 4. Real-World Use Cases 4.1. Regr
doi_str_mv 10.5281/zenodo.14171427
format Dataset
fullrecord <record><control><sourceid>datacite_PQ8</sourceid><recordid>TN_cdi_datacite_primary_10_5281_zenodo_14171427</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>10_5281_zenodo_14171427</sourcerecordid><originalsourceid>FETCH-datacite_primary_10_5281_zenodo_141714273</originalsourceid><addsrcrecordid>eNqVjjtuAkEQRCchsIDYaV-AhcEg0Gb8VhCQ2BCPmt1ZaImdHk03AZyen30AJ1VJPdUz5tMOsvFwavs3H7jizI7sxI6Gkw_DswqjFj41OcyJY-LSi8CWAyknCkfYyzOL3eYbJPpSE0vJ8ZrDJggdTypAQRl-LgfRhOphVdePmQCGCpbcIAV4naASh45p1XgW3_3ttukXq91i3atQsST1LiZqMF2dHbinsnsruz_lr_8Td5yQUsk</addsrcrecordid><sourcetype>Publisher</sourcetype><iscdi>true</iscdi><recordtype>dataset</recordtype></control><display><type>dataset</type><title>AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation</title><source>DataCite</source><creator>Arefi, Arman ; Babor, Majharulislam ; Liu, Shanghua ; Höhne, Marina M.-C. ; Sturm, Barbara ; Gómez, Pablo López ; Venus, Joachim ; Olszewska-Widdrat, Agata</creator><creatorcontrib>Arefi, Arman ; Babor, Majharulislam ; Liu, Shanghua ; Höhne, Marina M.-C. ; Sturm, Barbara ; Gómez, Pablo López ; Venus, Joachim ; Olszewska-Widdrat, Agata</creatorcontrib><description>  1. Introduction The AdaptFerm dataset is designed to support the development of a monitoring framework for lactic acid production fermentation using Fourier Transform Infrared (FTIR) spectroscopy. Its primary goal is to facilitate the control strategies for continuous fermentation processes to maximize the lactic acid production. The AdaptFerm encompasses data from two distinct batch fermentation environments: one employing simple sugar (glucose) as the substrate and the other utilizing complex sugars derived from bio-waste. The study focuses on developing accurate predictive models for glucose and lactic acid concentrations, with an emphasis on applying classical machine learning techniques and enhancing domain generalization capabilities. 2. Prediction Model for Different Substrate Environments The chemical composition of substrates are presented in Table 1 [1]. The dataset is utilized to train and test models within the same substrate domain. For instance, data from a single fermentation environment (e.g., glucose substrate) is used for both training and testing phases. The applied machine learning models showed accurate prediction within the same domain [1]. For more details on the methods applied, please refer to the following link: https://doi.org/10.1016/j.heliyon.2024.e38791. In this study, the MIR results correspond to the AdaptFerm dataset. The spectra of the glucose and biowaste hydrolysate fermentation process are presented in Figure 3 and Figure 4. 3. Domain Adaptation The dataset was also used to address the challenge posed by shifts in FTIR data when substrates change. Transitioning from simple sugar (glucose) to complex sugar (bio-waste) causes significant variations in the FTIR spectra, making it difficult for models trained on glucose fermentation data to maintain prediction accuracy in the complex sugar fermentation environment. This results in reduced robustness and performance when applied to out-of-distribution data. To address these challenges, we explore methods that improve the generalization ability and robustness of models in such scenarios without using labels from complex sugar fermentation [2]. It shows the application of machine learning interpretation to find domain invariant features for glucose and lactic acid. For more details on the methods applied, please refer to the following link: https://dx.doi.org/10.2139/ssrn.5012080. The code is available at https://github.com/shl-shawn/ShapFS. 4. Real-World Use Cases 4.1. Regression Task AdaptFerm serves as a benchmark for machine learning model applications in fermentation processes, specifically for predicting glucose and lactic acid concentrations, measured in g/L (grams per liter), while considering issues of out-of-distribution generalization. 4.2. Domain Adaptation Regression Task The dataset is also suitable for evaluating different  domain adaptation methods. In particular, the glucose substrate fermentation data can be used as the source domain, while the complex sugar fermentation data from bio-waste serves as the target domain. For semi-supervised domain adaptation approaches, it is recommended to use the initial data points (i.e., those collected at the beginning of the fermentation process) from the target domain, as the dataset is organized chronologically by collection day. These approaches aim to improve the robustness of models by transferring knowledge across domains and mitigating the effects of out-of-distribution data. 4.3. Anomaly Detection The dataset can be used to train anomaly detection models to identify outliers or deviations from normal fermentation behavior. This could be valuable in industrial bioprocessing, where early detection of issues like contamination or process failure is crucial. Techniques like Isolation Forests, One-Class SVM, or Autoencoders could be applied to identify unusual patterns in FTIR spectra. 4.4. Classification Task Although the main task is regression, the dataset could also be used in classification tasks by discretizing the concentrations of glucose and lactic acid into categories (e.g., low, medium, high). This would allow for the application of classification algorithms like Support Vector Machines (SVM), Random Forests, or Neural Networks for predicting the fermentation phase or identifying specific operational conditions. 4.5. Transfer Learning Given the nature of the domain adaptation approach in this dataset, transfer learning models can be explored. Models pre-trained on glucose fermentation data can be fine-tuned on complex sugar fermentation data, enabling quicker model convergence and improved performance in data-scarce environments. 4.6. Multi-Task Learning In a multi-task learning scenario, models could simultaneously predict both glucose and lactic acid concentrations from the same FTIR data. This could help in improving model accuracy by leveraging shared representations across the two tasks. 4.7. Feature Selection The FTIR spectral data contains a large number of features (wavelengths), and feature selection techniques such as Recursive Feature Elimination (RFE), Lasso regression, or mutual information could be applied to identify the most relevant wavelengths for predicting glucose and lactic acid concentrations, improving model performance and interpretability. 5. Dataset Structure and Meta Information The dataset is organized into four Excel files, corresponding to two main fermentation domains (different substrates) and two key process variables: a) Simple Sugar SubstrateThis domain contains data for the fermentation process using glucose as the substrate to produce lactic acid. It includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. b) Complex Sugar SubstrateThis doman contains data for the fermentation process using bio-waste as the substrate to produce lactic acid. Similar to the previous domain, it includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. Each file is structured as follows: The first column contains the sample ID, which serves as the timeline of measurements (Sample ID 1 represents the first measurement in the fermentation process). From the second column onwards, the FTIR data is provided, covering the spectral range from 549.6 cm-1 to 3999.6 cm-1 comprising 3,579 features. The final column contains the ground truth data, the chemical measurements of fermentation variables such as glucose and lactic acid concentrations, both measured in g/L. 6. Conclusion The AdaptFerm features FTIR spectra data from two distinct fermentation environments: simple sugar (glucose) and complex sugar (bio-waste). The dataset is designed to be used in regression tasks, including domain adaptation, and can be applied in machine learning model development for fermentation process monitoring, with a focus on enhancing model robustness and handling out-of-distribution data. This dataset provides a valuable resource for exploring domain shift and improving the robustness of machine learning models in bioengineering and fermentation processes. It enables further research into domain generalization techniques and offers a wide range of possibilities for machine learning applications. References  [1] Arman Arefi, Barbara Sturm, Majharulislam Babor, Michael Horf, Thomas Hoffmann, Marina Höhne, Kathleen Friedrich, Linda Schroedter, Joachim Venus, Agata Olszewska-Widdrat, Digital model of biochemical reactions in lactic acid bacterial fermentation of simple glucose and biowaste substrates, Heliyon, Volume 10, Issue 19, 2024, e38791, ISSN 2405-8440, DOI: 10.1016/j.heliyon.2024.e38791, https://doi.org/10.1016/j.heliyon.2024.e38791. [2]  Majharulislam Babor, Shanghua Liu, Arman Arefi, Agata Olszewska-Widdrat,  Barbara Sturm, Joachim Venus, and Marina M.-C. Höhne, Domain-Invariant Monitoring for Lactic Acid Production: Transfer Learning from Glucose to Bio-Waste Using Machine Learning Interpretation. Available at http://dx.doi.org/10.2139/ssrn.5012080.</description><identifier>DOI: 10.5281/zenodo.14171427</identifier><language>eng</language><publisher>Zenodo</publisher><subject>Agricultural biotechnology ; Bio-Waste ; Bioprocessing technologies ; Data Analysis ; Deep learning ; Domain Adaptation ; Environmental biotechnology ; Fermentation ; FOS: Agricultural biotechnology ; FOS: Environmental biotechnology ; FOS: Industrial biotechnology ; FTIR ; Glucose ; Industrial biotechnology ; Lactic Acid ; Machine Learning ; Out-of-Distribution ; Polylactic acid ; Process Monitoring ; Regression ; Spectra ; Spectroscopy ; Transfer learning ; Unsupervised learning ; Waste treatment processes</subject><creationdate>2024</creationdate><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0001-7708-1783 ; 0000-0002-5440-7573 ; 0009-0009-9855-9040 ; 0000-0002-7843-6846 ; 0000-0002-3247-3019</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,1888</link.rule.ids><linktorsrc>$$Uhttps://commons.datacite.org/doi.org/10.5281/zenodo.14171427$$EView_record_in_DataCite.org$$FView_record_in_$$GDataCite.org$$Hfree_for_read</linktorsrc></links><search><creatorcontrib>Arefi, Arman</creatorcontrib><creatorcontrib>Babor, Majharulislam</creatorcontrib><creatorcontrib>Liu, Shanghua</creatorcontrib><creatorcontrib>Höhne, Marina M.-C.</creatorcontrib><creatorcontrib>Sturm, Barbara</creatorcontrib><creatorcontrib>Gómez, Pablo López</creatorcontrib><creatorcontrib>Venus, Joachim</creatorcontrib><creatorcontrib>Olszewska-Widdrat, Agata</creatorcontrib><title>AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation</title><description>  1. Introduction The AdaptFerm dataset is designed to support the development of a monitoring framework for lactic acid production fermentation using Fourier Transform Infrared (FTIR) spectroscopy. Its primary goal is to facilitate the control strategies for continuous fermentation processes to maximize the lactic acid production. The AdaptFerm encompasses data from two distinct batch fermentation environments: one employing simple sugar (glucose) as the substrate and the other utilizing complex sugars derived from bio-waste. The study focuses on developing accurate predictive models for glucose and lactic acid concentrations, with an emphasis on applying classical machine learning techniques and enhancing domain generalization capabilities. 2. Prediction Model for Different Substrate Environments The chemical composition of substrates are presented in Table 1 [1]. The dataset is utilized to train and test models within the same substrate domain. For instance, data from a single fermentation environment (e.g., glucose substrate) is used for both training and testing phases. The applied machine learning models showed accurate prediction within the same domain [1]. For more details on the methods applied, please refer to the following link: https://doi.org/10.1016/j.heliyon.2024.e38791. In this study, the MIR results correspond to the AdaptFerm dataset. The spectra of the glucose and biowaste hydrolysate fermentation process are presented in Figure 3 and Figure 4. 3. Domain Adaptation The dataset was also used to address the challenge posed by shifts in FTIR data when substrates change. Transitioning from simple sugar (glucose) to complex sugar (bio-waste) causes significant variations in the FTIR spectra, making it difficult for models trained on glucose fermentation data to maintain prediction accuracy in the complex sugar fermentation environment. This results in reduced robustness and performance when applied to out-of-distribution data. To address these challenges, we explore methods that improve the generalization ability and robustness of models in such scenarios without using labels from complex sugar fermentation [2]. It shows the application of machine learning interpretation to find domain invariant features for glucose and lactic acid. For more details on the methods applied, please refer to the following link: https://dx.doi.org/10.2139/ssrn.5012080. The code is available at https://github.com/shl-shawn/ShapFS. 4. Real-World Use Cases 4.1. Regression Task AdaptFerm serves as a benchmark for machine learning model applications in fermentation processes, specifically for predicting glucose and lactic acid concentrations, measured in g/L (grams per liter), while considering issues of out-of-distribution generalization. 4.2. Domain Adaptation Regression Task The dataset is also suitable for evaluating different  domain adaptation methods. In particular, the glucose substrate fermentation data can be used as the source domain, while the complex sugar fermentation data from bio-waste serves as the target domain. For semi-supervised domain adaptation approaches, it is recommended to use the initial data points (i.e., those collected at the beginning of the fermentation process) from the target domain, as the dataset is organized chronologically by collection day. These approaches aim to improve the robustness of models by transferring knowledge across domains and mitigating the effects of out-of-distribution data. 4.3. Anomaly Detection The dataset can be used to train anomaly detection models to identify outliers or deviations from normal fermentation behavior. This could be valuable in industrial bioprocessing, where early detection of issues like contamination or process failure is crucial. Techniques like Isolation Forests, One-Class SVM, or Autoencoders could be applied to identify unusual patterns in FTIR spectra. 4.4. Classification Task Although the main task is regression, the dataset could also be used in classification tasks by discretizing the concentrations of glucose and lactic acid into categories (e.g., low, medium, high). This would allow for the application of classification algorithms like Support Vector Machines (SVM), Random Forests, or Neural Networks for predicting the fermentation phase or identifying specific operational conditions. 4.5. Transfer Learning Given the nature of the domain adaptation approach in this dataset, transfer learning models can be explored. Models pre-trained on glucose fermentation data can be fine-tuned on complex sugar fermentation data, enabling quicker model convergence and improved performance in data-scarce environments. 4.6. Multi-Task Learning In a multi-task learning scenario, models could simultaneously predict both glucose and lactic acid concentrations from the same FTIR data. This could help in improving model accuracy by leveraging shared representations across the two tasks. 4.7. Feature Selection The FTIR spectral data contains a large number of features (wavelengths), and feature selection techniques such as Recursive Feature Elimination (RFE), Lasso regression, or mutual information could be applied to identify the most relevant wavelengths for predicting glucose and lactic acid concentrations, improving model performance and interpretability. 5. Dataset Structure and Meta Information The dataset is organized into four Excel files, corresponding to two main fermentation domains (different substrates) and two key process variables: a) Simple Sugar SubstrateThis domain contains data for the fermentation process using glucose as the substrate to produce lactic acid. It includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. b) Complex Sugar SubstrateThis doman contains data for the fermentation process using bio-waste as the substrate to produce lactic acid. Similar to the previous domain, it includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. Each file is structured as follows: The first column contains the sample ID, which serves as the timeline of measurements (Sample ID 1 represents the first measurement in the fermentation process). From the second column onwards, the FTIR data is provided, covering the spectral range from 549.6 cm-1 to 3999.6 cm-1 comprising 3,579 features. The final column contains the ground truth data, the chemical measurements of fermentation variables such as glucose and lactic acid concentrations, both measured in g/L. 6. Conclusion The AdaptFerm features FTIR spectra data from two distinct fermentation environments: simple sugar (glucose) and complex sugar (bio-waste). The dataset is designed to be used in regression tasks, including domain adaptation, and can be applied in machine learning model development for fermentation process monitoring, with a focus on enhancing model robustness and handling out-of-distribution data. This dataset provides a valuable resource for exploring domain shift and improving the robustness of machine learning models in bioengineering and fermentation processes. It enables further research into domain generalization techniques and offers a wide range of possibilities for machine learning applications. References  [1] Arman Arefi, Barbara Sturm, Majharulislam Babor, Michael Horf, Thomas Hoffmann, Marina Höhne, Kathleen Friedrich, Linda Schroedter, Joachim Venus, Agata Olszewska-Widdrat, Digital model of biochemical reactions in lactic acid bacterial fermentation of simple glucose and biowaste substrates, Heliyon, Volume 10, Issue 19, 2024, e38791, ISSN 2405-8440, DOI: 10.1016/j.heliyon.2024.e38791, https://doi.org/10.1016/j.heliyon.2024.e38791. [2]  Majharulislam Babor, Shanghua Liu, Arman Arefi, Agata Olszewska-Widdrat,  Barbara Sturm, Joachim Venus, and Marina M.-C. Höhne, Domain-Invariant Monitoring for Lactic Acid Production: Transfer Learning from Glucose to Bio-Waste Using Machine Learning Interpretation. Available at http://dx.doi.org/10.2139/ssrn.5012080.</description><subject>Agricultural biotechnology</subject><subject>Bio-Waste</subject><subject>Bioprocessing technologies</subject><subject>Data Analysis</subject><subject>Deep learning</subject><subject>Domain Adaptation</subject><subject>Environmental biotechnology</subject><subject>Fermentation</subject><subject>FOS: Agricultural biotechnology</subject><subject>FOS: Environmental biotechnology</subject><subject>FOS: Industrial biotechnology</subject><subject>FTIR</subject><subject>Glucose</subject><subject>Industrial biotechnology</subject><subject>Lactic Acid</subject><subject>Machine Learning</subject><subject>Out-of-Distribution</subject><subject>Polylactic acid</subject><subject>Process Monitoring</subject><subject>Regression</subject><subject>Spectra</subject><subject>Spectroscopy</subject><subject>Transfer learning</subject><subject>Unsupervised learning</subject><subject>Waste treatment processes</subject><fulltext>true</fulltext><rsrctype>dataset</rsrctype><creationdate>2024</creationdate><recordtype>dataset</recordtype><sourceid>PQ8</sourceid><recordid>eNqVjjtuAkEQRCchsIDYaV-AhcEg0Gb8VhCQ2BCPmt1ZaImdHk03AZyen30AJ1VJPdUz5tMOsvFwavs3H7jizI7sxI6Gkw_DswqjFj41OcyJY-LSi8CWAyknCkfYyzOL3eYbJPpSE0vJ8ZrDJggdTypAQRl-LgfRhOphVdePmQCGCpbcIAV4naASh45p1XgW3_3ttukXq91i3atQsST1LiZqMF2dHbinsnsruz_lr_8Td5yQUsk</recordid><startdate>20241115</startdate><enddate>20241115</enddate><creator>Arefi, Arman</creator><creator>Babor, Majharulislam</creator><creator>Liu, Shanghua</creator><creator>Höhne, Marina M.-C.</creator><creator>Sturm, Barbara</creator><creator>Gómez, Pablo López</creator><creator>Venus, Joachim</creator><creator>Olszewska-Widdrat, Agata</creator><general>Zenodo</general><scope>DYCCY</scope><scope>PQ8</scope><orcidid>https://orcid.org/0000-0001-7708-1783</orcidid><orcidid>https://orcid.org/0000-0002-5440-7573</orcidid><orcidid>https://orcid.org/0009-0009-9855-9040</orcidid><orcidid>https://orcid.org/0000-0002-7843-6846</orcidid><orcidid>https://orcid.org/0000-0002-3247-3019</orcidid></search><sort><creationdate>20241115</creationdate><title>AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation</title><author>Arefi, Arman ; Babor, Majharulislam ; Liu, Shanghua ; Höhne, Marina M.-C. ; Sturm, Barbara ; Gómez, Pablo López ; Venus, Joachim ; Olszewska-Widdrat, Agata</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-datacite_primary_10_5281_zenodo_141714273</frbrgroupid><rsrctype>datasets</rsrctype><prefilter>datasets</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Agricultural biotechnology</topic><topic>Bio-Waste</topic><topic>Bioprocessing technologies</topic><topic>Data Analysis</topic><topic>Deep learning</topic><topic>Domain Adaptation</topic><topic>Environmental biotechnology</topic><topic>Fermentation</topic><topic>FOS: Agricultural biotechnology</topic><topic>FOS: Environmental biotechnology</topic><topic>FOS: Industrial biotechnology</topic><topic>FTIR</topic><topic>Glucose</topic><topic>Industrial biotechnology</topic><topic>Lactic Acid</topic><topic>Machine Learning</topic><topic>Out-of-Distribution</topic><topic>Polylactic acid</topic><topic>Process Monitoring</topic><topic>Regression</topic><topic>Spectra</topic><topic>Spectroscopy</topic><topic>Transfer learning</topic><topic>Unsupervised learning</topic><topic>Waste treatment processes</topic><toplevel>online_resources</toplevel><creatorcontrib>Arefi, Arman</creatorcontrib><creatorcontrib>Babor, Majharulislam</creatorcontrib><creatorcontrib>Liu, Shanghua</creatorcontrib><creatorcontrib>Höhne, Marina M.-C.</creatorcontrib><creatorcontrib>Sturm, Barbara</creatorcontrib><creatorcontrib>Gómez, Pablo López</creatorcontrib><creatorcontrib>Venus, Joachim</creatorcontrib><creatorcontrib>Olszewska-Widdrat, Agata</creatorcontrib><collection>DataCite (Open Access)</collection><collection>DataCite</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Arefi, Arman</au><au>Babor, Majharulislam</au><au>Liu, Shanghua</au><au>Höhne, Marina M.-C.</au><au>Sturm, Barbara</au><au>Gómez, Pablo López</au><au>Venus, Joachim</au><au>Olszewska-Widdrat, Agata</au><format>book</format><genre>unknown</genre><ristype>DATA</ristype><title>AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation</title><date>2024-11-15</date><risdate>2024</risdate><abstract>  1. Introduction The AdaptFerm dataset is designed to support the development of a monitoring framework for lactic acid production fermentation using Fourier Transform Infrared (FTIR) spectroscopy. Its primary goal is to facilitate the control strategies for continuous fermentation processes to maximize the lactic acid production. The AdaptFerm encompasses data from two distinct batch fermentation environments: one employing simple sugar (glucose) as the substrate and the other utilizing complex sugars derived from bio-waste. The study focuses on developing accurate predictive models for glucose and lactic acid concentrations, with an emphasis on applying classical machine learning techniques and enhancing domain generalization capabilities. 2. Prediction Model for Different Substrate Environments The chemical composition of substrates are presented in Table 1 [1]. The dataset is utilized to train and test models within the same substrate domain. For instance, data from a single fermentation environment (e.g., glucose substrate) is used for both training and testing phases. The applied machine learning models showed accurate prediction within the same domain [1]. For more details on the methods applied, please refer to the following link: https://doi.org/10.1016/j.heliyon.2024.e38791. In this study, the MIR results correspond to the AdaptFerm dataset. The spectra of the glucose and biowaste hydrolysate fermentation process are presented in Figure 3 and Figure 4. 3. Domain Adaptation The dataset was also used to address the challenge posed by shifts in FTIR data when substrates change. Transitioning from simple sugar (glucose) to complex sugar (bio-waste) causes significant variations in the FTIR spectra, making it difficult for models trained on glucose fermentation data to maintain prediction accuracy in the complex sugar fermentation environment. This results in reduced robustness and performance when applied to out-of-distribution data. To address these challenges, we explore methods that improve the generalization ability and robustness of models in such scenarios without using labels from complex sugar fermentation [2]. It shows the application of machine learning interpretation to find domain invariant features for glucose and lactic acid. For more details on the methods applied, please refer to the following link: https://dx.doi.org/10.2139/ssrn.5012080. The code is available at https://github.com/shl-shawn/ShapFS. 4. Real-World Use Cases 4.1. Regression Task AdaptFerm serves as a benchmark for machine learning model applications in fermentation processes, specifically for predicting glucose and lactic acid concentrations, measured in g/L (grams per liter), while considering issues of out-of-distribution generalization. 4.2. Domain Adaptation Regression Task The dataset is also suitable for evaluating different  domain adaptation methods. In particular, the glucose substrate fermentation data can be used as the source domain, while the complex sugar fermentation data from bio-waste serves as the target domain. For semi-supervised domain adaptation approaches, it is recommended to use the initial data points (i.e., those collected at the beginning of the fermentation process) from the target domain, as the dataset is organized chronologically by collection day. These approaches aim to improve the robustness of models by transferring knowledge across domains and mitigating the effects of out-of-distribution data. 4.3. Anomaly Detection The dataset can be used to train anomaly detection models to identify outliers or deviations from normal fermentation behavior. This could be valuable in industrial bioprocessing, where early detection of issues like contamination or process failure is crucial. Techniques like Isolation Forests, One-Class SVM, or Autoencoders could be applied to identify unusual patterns in FTIR spectra. 4.4. Classification Task Although the main task is regression, the dataset could also be used in classification tasks by discretizing the concentrations of glucose and lactic acid into categories (e.g., low, medium, high). This would allow for the application of classification algorithms like Support Vector Machines (SVM), Random Forests, or Neural Networks for predicting the fermentation phase or identifying specific operational conditions. 4.5. Transfer Learning Given the nature of the domain adaptation approach in this dataset, transfer learning models can be explored. Models pre-trained on glucose fermentation data can be fine-tuned on complex sugar fermentation data, enabling quicker model convergence and improved performance in data-scarce environments. 4.6. Multi-Task Learning In a multi-task learning scenario, models could simultaneously predict both glucose and lactic acid concentrations from the same FTIR data. This could help in improving model accuracy by leveraging shared representations across the two tasks. 4.7. Feature Selection The FTIR spectral data contains a large number of features (wavelengths), and feature selection techniques such as Recursive Feature Elimination (RFE), Lasso regression, or mutual information could be applied to identify the most relevant wavelengths for predicting glucose and lactic acid concentrations, improving model performance and interpretability. 5. Dataset Structure and Meta Information The dataset is organized into four Excel files, corresponding to two main fermentation domains (different substrates) and two key process variables: a) Simple Sugar SubstrateThis domain contains data for the fermentation process using glucose as the substrate to produce lactic acid. It includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. b) Complex Sugar SubstrateThis doman contains data for the fermentation process using bio-waste as the substrate to produce lactic acid. Similar to the previous domain, it includes two files—one for glucose concentrations and one for lactic acid concentrations. Both are measured in g/L. Each file is structured as follows: The first column contains the sample ID, which serves as the timeline of measurements (Sample ID 1 represents the first measurement in the fermentation process). From the second column onwards, the FTIR data is provided, covering the spectral range from 549.6 cm-1 to 3999.6 cm-1 comprising 3,579 features. The final column contains the ground truth data, the chemical measurements of fermentation variables such as glucose and lactic acid concentrations, both measured in g/L. 6. Conclusion The AdaptFerm features FTIR spectra data from two distinct fermentation environments: simple sugar (glucose) and complex sugar (bio-waste). The dataset is designed to be used in regression tasks, including domain adaptation, and can be applied in machine learning model development for fermentation process monitoring, with a focus on enhancing model robustness and handling out-of-distribution data. This dataset provides a valuable resource for exploring domain shift and improving the robustness of machine learning models in bioengineering and fermentation processes. It enables further research into domain generalization techniques and offers a wide range of possibilities for machine learning applications. References  [1] Arman Arefi, Barbara Sturm, Majharulislam Babor, Michael Horf, Thomas Hoffmann, Marina Höhne, Kathleen Friedrich, Linda Schroedter, Joachim Venus, Agata Olszewska-Widdrat, Digital model of biochemical reactions in lactic acid bacterial fermentation of simple glucose and biowaste substrates, Heliyon, Volume 10, Issue 19, 2024, e38791, ISSN 2405-8440, DOI: 10.1016/j.heliyon.2024.e38791, https://doi.org/10.1016/j.heliyon.2024.e38791. [2]  Majharulislam Babor, Shanghua Liu, Arman Arefi, Agata Olszewska-Widdrat,  Barbara Sturm, Joachim Venus, and Marina M.-C. Höhne, Domain-Invariant Monitoring for Lactic Acid Production: Transfer Learning from Glucose to Bio-Waste Using Machine Learning Interpretation. Available at http://dx.doi.org/10.2139/ssrn.5012080.</abstract><pub>Zenodo</pub><doi>10.5281/zenodo.14171427</doi><orcidid>https://orcid.org/0000-0001-7708-1783</orcidid><orcidid>https://orcid.org/0000-0002-5440-7573</orcidid><orcidid>https://orcid.org/0009-0009-9855-9040</orcidid><orcidid>https://orcid.org/0000-0002-7843-6846</orcidid><orcidid>https://orcid.org/0000-0002-3247-3019</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.5281/zenodo.14171427
ispartof
issn
language eng
recordid cdi_datacite_primary_10_5281_zenodo_14171427
source DataCite
subjects Agricultural biotechnology
Bio-Waste
Bioprocessing technologies
Data Analysis
Deep learning
Domain Adaptation
Environmental biotechnology
Fermentation
FOS: Agricultural biotechnology
FOS: Environmental biotechnology
FOS: Industrial biotechnology
FTIR
Glucose
Industrial biotechnology
Lactic Acid
Machine Learning
Out-of-Distribution
Polylactic acid
Process Monitoring
Regression
Spectra
Spectroscopy
Transfer learning
Unsupervised learning
Waste treatment processes
title AdaptFerm: Bioprocess Monitoring Using FTIR spectroscopy: Insights into Substrate Effects and Domain Adaptation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T22%3A57%3A08IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-datacite_PQ8&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=unknown&rft.au=Arefi,%20Arman&rft.date=2024-11-15&rft_id=info:doi/10.5281/zenodo.14171427&rft_dat=%3Cdatacite_PQ8%3E10_5281_zenodo_14171427%3C/datacite_PQ8%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true