Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning

The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which che...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACS ES&T engineering 2022-07, Vol.2 (7), p.1211-1220
Hauptverfasser:	Zhong, Shifa, Lambeth, Dylan R, Igou, Thomas K, Chen, Yongsheng
Format:	Artikel
Sprache:	eng
Schlagworte:	data collection normal distribution prediction quantitative structure-activity relationships solubility standard deviation uncertainty
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1220
container_issue	7
container_start_page	1211
container_title	ACS ES&T engineering
container_volume	2
creator	Zhong, Shifa Lambeth, Dylan R Igou, Thomas K Chen, Yongsheng
description	The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.
doi_str_mv	10.1021/acsestengg.1c00434
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_2718263114</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2718263114</sourcerecordid><originalsourceid>FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</originalsourceid><addsrcrecordid>eNp9UMlOwzAQjRBIVNAf4OQjlxRPtjrHUsoiFSGWniPHmaSuUjvYDlJv8A38IV-CSyvBCWmk2d57o3lBcAZ0BDSCCy4sWoeqaUYgKE3i5CAYRFlOQ5ol6eGf-jgYWruilEZxyoClg-BjplpuGqkaMum6Vgpeyla6DbnSay4V0TV57Lly0nEn35A8O9ML1xv8ev-cCD_aYp-w9Vut7FJ25F5X2Frilkb3zZIslEDjvJTbhJfcYkV-aEjmyI3yd0-Do5q3Fof7fBIsrmcv09tw_nBzN53MQx5D7kKBOZSsFHklUFQ0YhlLfQ8RCF7zEpCW47KKGUtiliPNeZYB98HGVZ4Ag_gkON_pdka_9t6wYi2twLblCnVvi2gMLMpigMRDox1UGG2twbrojFxzsymAFlvLi1_Li73lnjTakfyuWOneKP_Nf4Rv1oaLZg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2718263114</pqid></control><display><type>article</type><title>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</title><source>ACS Journals</source><creator>Zhong, Shifa ; Lambeth, Dylan R ; Igou, Thomas K ; Chen, Yongsheng</creator><creatorcontrib>Zhong, Shifa ; Lambeth, Dylan R ; Igou, Thomas K ; Chen, Yongsheng</creatorcontrib><description>The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.</description><identifier>ISSN: 2690-0645</identifier><identifier>EISSN: 2690-0645</identifier><identifier>DOI: 10.1021/acsestengg.1c00434</identifier><language>eng</language><publisher>American Chemical Society</publisher><subject>data collection ; normal distribution ; prediction ; quantitative structure-activity relationships ; solubility ; standard deviation ; uncertainty</subject><ispartof>ACS ES&T engineering, 2022-07, Vol.2 (7), p.1211-1220</ispartof><rights>2022 American Chemical Society</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</citedby><cites>FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</cites><orcidid>0000-0002-5822-0837 ; 0000-0002-9519-2302</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/acsestengg.1c00434$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/acsestengg.1c00434$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>314,777,781,2752,27057,27905,27906,56719,56769</link.rule.ids></links><search><creatorcontrib>Zhong, Shifa</creatorcontrib><creatorcontrib>Lambeth, Dylan R</creatorcontrib><creatorcontrib>Igou, Thomas K</creatorcontrib><creatorcontrib>Chen, Yongsheng</creatorcontrib><title>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</title><title>ACS ES&T engineering</title><addtitle>ACS EST Engg</addtitle><description>The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.</description><subject>data collection</subject><subject>normal distribution</subject><subject>prediction</subject><subject>quantitative structure-activity relationships</subject><subject>solubility</subject><subject>standard deviation</subject><subject>uncertainty</subject><issn>2690-0645</issn><issn>2690-0645</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><recordid>eNp9UMlOwzAQjRBIVNAf4OQjlxRPtjrHUsoiFSGWniPHmaSuUjvYDlJv8A38IV-CSyvBCWmk2d57o3lBcAZ0BDSCCy4sWoeqaUYgKE3i5CAYRFlOQ5ol6eGf-jgYWruilEZxyoClg-BjplpuGqkaMum6Vgpeyla6DbnSay4V0TV57Lly0nEn35A8O9ML1xv8ev-cCD_aYp-w9Vut7FJ25F5X2Frilkb3zZIslEDjvJTbhJfcYkV-aEjmyI3yd0-Do5q3Fof7fBIsrmcv09tw_nBzN53MQx5D7kKBOZSsFHklUFQ0YhlLfQ8RCF7zEpCW47KKGUtiliPNeZYB98HGVZ4Ag_gkON_pdka_9t6wYi2twLblCnVvi2gMLMpigMRDox1UGG2twbrojFxzsymAFlvLi1_Li73lnjTakfyuWOneKP_Nf4Rv1oaLZg</recordid><startdate>20220708</startdate><enddate>20220708</enddate><creator>Zhong, Shifa</creator><creator>Lambeth, Dylan R</creator><creator>Igou, Thomas K</creator><creator>Chen, Yongsheng</creator><general>American Chemical Society</general><scope>AAYXX</scope><scope>CITATION</scope><scope>7S9</scope><scope>L.6</scope><orcidid>https://orcid.org/0000-0002-5822-0837</orcidid><orcidid>https://orcid.org/0000-0002-9519-2302</orcidid></search><sort><creationdate>20220708</creationdate><title>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</title><author>Zhong, Shifa ; Lambeth, Dylan R ; Igou, Thomas K ; Chen, Yongsheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a319t-ce91b8bc9dcecd028685b8b121cafab1e0b7bd3884389e09a661a61a87d941813</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>data collection</topic><topic>normal distribution</topic><topic>prediction</topic><topic>quantitative structure-activity relationships</topic><topic>solubility</topic><topic>standard deviation</topic><topic>uncertainty</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhong, Shifa</creatorcontrib><creatorcontrib>Lambeth, Dylan R</creatorcontrib><creatorcontrib>Igou, Thomas K</creatorcontrib><creatorcontrib>Chen, Yongsheng</creatorcontrib><collection>CrossRef</collection><collection>AGRICOLA</collection><collection>AGRICOLA - Academic</collection><jtitle>ACS ES&T engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhong, Shifa</au><au>Lambeth, Dylan R</au><au>Igou, Thomas K</au><au>Chen, Yongsheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning</atitle><jtitle>ACS ES&T engineering</jtitle><addtitle>ACS EST Engg</addtitle><date>2022-07-08</date><risdate>2022</risdate><volume>2</volume><issue>7</issue><spage>1211</spage><epage>1220</epage><pages>1211-1220</pages><issn>2690-0645</issn><eissn>2690-0645</eissn><abstract>The first step to develop a quantitative structure–activity relationship (QSAR) model is to identify a set of chemicals with known activities/properties, which can be either collected from the published studies or measured experimentally. A key challenge in this process is how to determine which chemicals are used to train a QSAR model, and, of those chemicals, which should be prioritized in experimental trials to ensure that the obtained models have large applicability domains (ADs). In this study, we employ uncertainty-based active learning (AC) to address this challenge. We use the Gaussian process (GP) to develop QSAR models for three public datasets, Koc, solubility, and k •OH, each with a number of chemicals represented by molecular descriptors, in which the GP can offer prediction uncertainty (by means of standard deviation) for the model’s prediction. The training chemicals of each dataset are selected in two different ways: (1) random splitting (RS) and (2) uncertainty-based AC. Uncertainty-based AC iteratively identifies chemicals with the highest uncertainty and selects them for model training. We demonstrate that the chemicals selected by AC are more diverse than those selected by RS and that AC-based QSAR models have better generalizability than those derived from RS. We then use these two types of models to predict the properties of chemicals in the REACH dataset (>300,000 chemicals) and assess their ADs using five different AD determination methods. We demonstrate that the AD of AC-based QSAR models for all AD methods is significantly larger than those of RS-based models (up to 24 times larger). This study provides a novel method to enlarge the AD of QSAR models, which can guide model development and improve the property prediction reliability for more REACH dataset chemicals while minimizing the development cost and time.</abstract><pub>American Chemical Society</pub><doi>10.1021/acsestengg.1c00434</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-5822-0837</orcidid><orcidid>https://orcid.org/0000-0002-9519-2302</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 2690-0645
ispartof	ACS ES&T engineering, 2022-07, Vol.2 (7), p.1211-1220
issn	2690-0645 2690-0645
language	eng
recordid	cdi_proquest_miscellaneous_2718263114
source	ACS Journals
subjects	data collection normal distribution prediction quantitative structure-activity relationships solubility standard deviation uncertainty
title	Enlarging Applicability Domain of Quantitative Structure–Activity Relationship Models through Uncertainty-Based Active Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T17%3A15%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enlarging%20Applicability%20Domain%20of%20Quantitative%20Structure%E2%80%93Activity%20Relationship%20Models%20through%20Uncertainty-Based%20Active%20Learning&rft.jtitle=ACS%20ES&T%20engineering&rft.au=Zhong,%20Shifa&rft.date=2022-07-08&rft.volume=2&rft.issue=7&rft.spage=1211&rft.epage=1220&rft.pages=1211-1220&rft.issn=2690-0645&rft.eissn=2690-0645&rft_id=info:doi/10.1021/acsestengg.1c00434&rft_dat=%3Cproquest_cross%3E2718263114%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2718263114&rft_id=info:pmid/&rfr_iscdi=true