ARKA: a framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data

Due to the lack of experimental toxicity data for environmental chemicals, there arises a need to fill data gaps by in silico approaches. One of the most commonly used in silico approaches for toxicity assessment of small datasets is the Quantitative Structure-Activity Relationship (QSAR), which gen...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Environmental science--processes & impacts 2024-06, Vol.26 (6), p.991-17
Hauptverfasser:	Banerjee, Arkaprava, Roy, Kunal
Format:	Artikel
Sprache:	eng
Schlagworte:	Algae Animals Carcinogenicity Carcinogens Classification Data points Datasets Environmental Pollutants - chemistry Environmental Pollutants - toxicity Expert systems Machine Learning Modelling Partitioning Prediction models Predictions Quantitative Structure-Activity Relationship Risk assessment Risk Assessment - methods Structure-activity relationships Toxicity Toxicity Tests
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	17
container_issue	6
container_start_page	991
container_title	Environmental science--processes & impacts
container_volume	26
creator	Banerjee, Arkaprava Roy, Kunal
description	Due to the lack of experimental toxicity data for environmental chemicals, there arises a need to fill data gaps by in silico approaches. One of the most commonly used in silico approaches for toxicity assessment of small datasets is the Quantitative Structure-Activity Relationship (QSAR), which generates predictive models for the efficient prediction of query compounds. However, the reliability of the predictions from QSARs derived from small datasets is often questionable from a statistical point of view. This is due to the presence of a larger number of descriptors as compared to the number of training compounds, which reduces the degree of freedom of the developed model. To reduce the overall prediction error for a particular QSAR model, we have proposed here the computation of the novel Arithmetic Residuals in K -groups Analysis (ARKA) descriptors. We have reduced the number of modeling descriptors in a supervised manner by partitioning them into K classes ( K = 2 here) depending on the higher mean normalized values of the descriptors to a particular response class, thus preventing the loss of chemical information. A scatter plot of the data points using the values of two ARKA descriptors (ARKA_2 vs. ARKA_1) can potentially identify activity cliffs, less confident data points, and less modelable data points. We have used here five representative environmentally relevant endpoints (skin sensitization, earthworm toxicity, milk/plasma partitioning, algal toxicity, and rodent carcinogenicity of hazardous chemicals) with graded responses to which the ARKA framework was applied for classification modeling. On comparing the performance of the models generated using conventional QSAR descriptors and the ARKA descriptors, the prediction quality of the models derived from ARKA descriptors was found, based on multiple graded-data validation metrics-derived decision criteria, much better than the models derived from QSAR descriptors signifying the potential of ARKA descriptors in ecotoxicological classification modeling of small data sets. Additionally, this holds true for the Read-Across approach as well, since the Read-Across predictions using ARKA descriptors supersede the predictions generated from QSAR descriptors. For the ease of users, a Java-based expert system has been developed that computes the ARKA descriptors from the input of QSAR descriptors. A scatter plot of the data points using the values of two ARKA descriptors can potentially identify activit
doi_str_mv	10.1039/d4em00173g
format	Article
fullrecord	<record><control><sourceid>proquest_rsc_p</sourceid><recordid>TN_cdi_rsc_primary_d4em00173g</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3069495084</sourcerecordid><originalsourceid>FETCH-LOGICAL-c373t-7301245c2bba68644d83cc6138404b1dec760140f62a92124f58d9b17b09849b3</originalsourceid><addsrcrecordid>eNpd0ktv1DAQAGALUdGq7YU7yBIXhBqwYztxuK1KH4iiShWco4kfi1vHXuwE6C_ib-J0yyLhi1-fZ6QZI_SckreUsO6d5mYkhLZs_QQd1ESQqpWdeLpby3YfHed8S8qQgkrRPEP7TLacEcEP0O_VzafVewzYJhjNz5jucLRYu9GE7GIA76Z7nIye1VS22MaER1DfXDCVN5CCC2usPOTsrFPwYMaojS_nJzi5fIfLncm5xJtOMASNNUyA17CprPMLW_LlDaRssAk_XIphseDxFH85taRfXhyhPQs-m-PH-RB9PT_7cnpZXV1ffDxdXVWKtWyqWkZozYWqhwEa2XCuJVOqoUxywgeqjWobQjmxTQ1dXagVUncDbQfSSd4N7BC93sbdpPh9NnnqR5eV8R6CiXPuS9UEFzXldaGv_qO3cU6lZItqOt4JInlRb7ZKpZhzMrbfJDdCuu8p6ZcO9h_42eeHDl4U_PIx5DyMRu_o334V8GILUla7239fgP0B7sahZg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3069495084</pqid></control><display><type>article</type><title>ARKA: a framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data</title><source>MEDLINE</source><source>Royal Society Of Chemistry Journals 2008-</source><creator>Banerjee, Arkaprava ; Roy, Kunal</creator><creatorcontrib>Banerjee, Arkaprava ; Roy, Kunal</creatorcontrib><description>Due to the lack of experimental toxicity data for environmental chemicals, there arises a need to fill data gaps by in silico approaches. One of the most commonly used in silico approaches for toxicity assessment of small datasets is the Quantitative Structure-Activity Relationship (QSAR), which generates predictive models for the efficient prediction of query compounds. However, the reliability of the predictions from QSARs derived from small datasets is often questionable from a statistical point of view. This is due to the presence of a larger number of descriptors as compared to the number of training compounds, which reduces the degree of freedom of the developed model. To reduce the overall prediction error for a particular QSAR model, we have proposed here the computation of the novel Arithmetic Residuals in K -groups Analysis (ARKA) descriptors. We have reduced the number of modeling descriptors in a supervised manner by partitioning them into K classes ( K = 2 here) depending on the higher mean normalized values of the descriptors to a particular response class, thus preventing the loss of chemical information. A scatter plot of the data points using the values of two ARKA descriptors (ARKA_2 vs. ARKA_1) can potentially identify activity cliffs, less confident data points, and less modelable data points. We have used here five representative environmentally relevant endpoints (skin sensitization, earthworm toxicity, milk/plasma partitioning, algal toxicity, and rodent carcinogenicity of hazardous chemicals) with graded responses to which the ARKA framework was applied for classification modeling. On comparing the performance of the models generated using conventional QSAR descriptors and the ARKA descriptors, the prediction quality of the models derived from ARKA descriptors was found, based on multiple graded-data validation metrics-derived decision criteria, much better than the models derived from QSAR descriptors signifying the potential of ARKA descriptors in ecotoxicological classification modeling of small data sets. Additionally, this holds true for the Read-Across approach as well, since the Read-Across predictions using ARKA descriptors supersede the predictions generated from QSAR descriptors. For the ease of users, a Java-based expert system has been developed that computes the ARKA descriptors from the input of QSAR descriptors. A scatter plot of the data points using the values of two ARKA descriptors can potentially identify activity cliffs, less confident data points, and less modelable data points.</description><identifier>ISSN: 2050-7887</identifier><identifier>ISSN: 2050-7895</identifier><identifier>EISSN: 2050-7895</identifier><identifier>DOI: 10.1039/d4em00173g</identifier><identifier>PMID: 38743054</identifier><language>eng</language><publisher>England: Royal Society of Chemistry</publisher><subject>Algae ; Animals ; Carcinogenicity ; Carcinogens ; Classification ; Data points ; Datasets ; Environmental Pollutants - chemistry ; Environmental Pollutants - toxicity ; Expert systems ; Machine Learning ; Modelling ; Partitioning ; Prediction models ; Predictions ; Quantitative Structure-Activity Relationship ; Risk assessment ; Risk Assessment - methods ; Structure-activity relationships ; Toxicity ; Toxicity Tests</subject><ispartof>Environmental science--processes & impacts, 2024-06, Vol.26 (6), p.991-17</ispartof><rights>Copyright Royal Society of Chemistry 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c373t-7301245c2bba68644d83cc6138404b1dec760140f62a92124f58d9b17b09849b3</citedby><cites>FETCH-LOGICAL-c373t-7301245c2bba68644d83cc6138404b1dec760140f62a92124f58d9b17b09849b3</cites><orcidid>0000-0003-4486-8074 ; 0000-0001-8468-0784</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,780,784,27924,27925</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38743054$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Banerjee, Arkaprava</creatorcontrib><creatorcontrib>Roy, Kunal</creatorcontrib><title>ARKA: a framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data</title><title>Environmental science--processes & impacts</title><addtitle>Environ Sci Process Impacts</addtitle><description>Due to the lack of experimental toxicity data for environmental chemicals, there arises a need to fill data gaps by in silico approaches. One of the most commonly used in silico approaches for toxicity assessment of small datasets is the Quantitative Structure-Activity Relationship (QSAR), which generates predictive models for the efficient prediction of query compounds. However, the reliability of the predictions from QSARs derived from small datasets is often questionable from a statistical point of view. This is due to the presence of a larger number of descriptors as compared to the number of training compounds, which reduces the degree of freedom of the developed model. To reduce the overall prediction error for a particular QSAR model, we have proposed here the computation of the novel Arithmetic Residuals in K -groups Analysis (ARKA) descriptors. We have reduced the number of modeling descriptors in a supervised manner by partitioning them into K classes ( K = 2 here) depending on the higher mean normalized values of the descriptors to a particular response class, thus preventing the loss of chemical information. A scatter plot of the data points using the values of two ARKA descriptors (ARKA_2 vs. ARKA_1) can potentially identify activity cliffs, less confident data points, and less modelable data points. We have used here five representative environmentally relevant endpoints (skin sensitization, earthworm toxicity, milk/plasma partitioning, algal toxicity, and rodent carcinogenicity of hazardous chemicals) with graded responses to which the ARKA framework was applied for classification modeling. On comparing the performance of the models generated using conventional QSAR descriptors and the ARKA descriptors, the prediction quality of the models derived from ARKA descriptors was found, based on multiple graded-data validation metrics-derived decision criteria, much better than the models derived from QSAR descriptors signifying the potential of ARKA descriptors in ecotoxicological classification modeling of small data sets. Additionally, this holds true for the Read-Across approach as well, since the Read-Across predictions using ARKA descriptors supersede the predictions generated from QSAR descriptors. For the ease of users, a Java-based expert system has been developed that computes the ARKA descriptors from the input of QSAR descriptors. A scatter plot of the data points using the values of two ARKA descriptors can potentially identify activity cliffs, less confident data points, and less modelable data points.</description><subject>Algae</subject><subject>Animals</subject><subject>Carcinogenicity</subject><subject>Carcinogens</subject><subject>Classification</subject><subject>Data points</subject><subject>Datasets</subject><subject>Environmental Pollutants - chemistry</subject><subject>Environmental Pollutants - toxicity</subject><subject>Expert systems</subject><subject>Machine Learning</subject><subject>Modelling</subject><subject>Partitioning</subject><subject>Prediction models</subject><subject>Predictions</subject><subject>Quantitative Structure-Activity Relationship</subject><subject>Risk assessment</subject><subject>Risk Assessment - methods</subject><subject>Structure-activity relationships</subject><subject>Toxicity</subject><subject>Toxicity Tests</subject><issn>2050-7887</issn><issn>2050-7895</issn><issn>2050-7895</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNpd0ktv1DAQAGALUdGq7YU7yBIXhBqwYztxuK1KH4iiShWco4kfi1vHXuwE6C_ib-J0yyLhi1-fZ6QZI_SckreUsO6d5mYkhLZs_QQd1ESQqpWdeLpby3YfHed8S8qQgkrRPEP7TLacEcEP0O_VzafVewzYJhjNz5jucLRYu9GE7GIA76Z7nIye1VS22MaER1DfXDCVN5CCC2usPOTsrFPwYMaojS_nJzi5fIfLncm5xJtOMASNNUyA17CprPMLW_LlDaRssAk_XIphseDxFH85taRfXhyhPQs-m-PH-RB9PT_7cnpZXV1ffDxdXVWKtWyqWkZozYWqhwEa2XCuJVOqoUxywgeqjWobQjmxTQ1dXagVUncDbQfSSd4N7BC93sbdpPh9NnnqR5eV8R6CiXPuS9UEFzXldaGv_qO3cU6lZItqOt4JInlRb7ZKpZhzMrbfJDdCuu8p6ZcO9h_42eeHDl4U_PIx5DyMRu_o334V8GILUla7239fgP0B7sahZg</recordid><startdate>20240619</startdate><enddate>20240619</enddate><creator>Banerjee, Arkaprava</creator><creator>Roy, Kunal</creator><general>Royal Society of Chemistry</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7ST</scope><scope>C1K</scope><scope>SOI</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-4486-8074</orcidid><orcidid>https://orcid.org/0000-0001-8468-0784</orcidid></search><sort><creationdate>20240619</creationdate><title>ARKA: a framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data</title><author>Banerjee, Arkaprava ; Roy, Kunal</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c373t-7301245c2bba68644d83cc6138404b1dec760140f62a92124f58d9b17b09849b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algae</topic><topic>Animals</topic><topic>Carcinogenicity</topic><topic>Carcinogens</topic><topic>Classification</topic><topic>Data points</topic><topic>Datasets</topic><topic>Environmental Pollutants - chemistry</topic><topic>Environmental Pollutants - toxicity</topic><topic>Expert systems</topic><topic>Machine Learning</topic><topic>Modelling</topic><topic>Partitioning</topic><topic>Prediction models</topic><topic>Predictions</topic><topic>Quantitative Structure-Activity Relationship</topic><topic>Risk assessment</topic><topic>Risk Assessment - methods</topic><topic>Structure-activity relationships</topic><topic>Toxicity</topic><topic>Toxicity Tests</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Banerjee, Arkaprava</creatorcontrib><creatorcontrib>Roy, Kunal</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Environment Abstracts</collection><collection>Environmental Sciences and Pollution Management</collection><collection>Environment Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>Environmental science--processes & impacts</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Banerjee, Arkaprava</au><au>Roy, Kunal</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ARKA: a framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data</atitle><jtitle>Environmental science--processes & impacts</jtitle><addtitle>Environ Sci Process Impacts</addtitle><date>2024-06-19</date><risdate>2024</risdate><volume>26</volume><issue>6</issue><spage>991</spage><epage>17</epage><pages>991-17</pages><issn>2050-7887</issn><issn>2050-7895</issn><eissn>2050-7895</eissn><abstract>Due to the lack of experimental toxicity data for environmental chemicals, there arises a need to fill data gaps by in silico approaches. One of the most commonly used in silico approaches for toxicity assessment of small datasets is the Quantitative Structure-Activity Relationship (QSAR), which generates predictive models for the efficient prediction of query compounds. However, the reliability of the predictions from QSARs derived from small datasets is often questionable from a statistical point of view. This is due to the presence of a larger number of descriptors as compared to the number of training compounds, which reduces the degree of freedom of the developed model. To reduce the overall prediction error for a particular QSAR model, we have proposed here the computation of the novel Arithmetic Residuals in K -groups Analysis (ARKA) descriptors. We have reduced the number of modeling descriptors in a supervised manner by partitioning them into K classes ( K = 2 here) depending on the higher mean normalized values of the descriptors to a particular response class, thus preventing the loss of chemical information. A scatter plot of the data points using the values of two ARKA descriptors (ARKA_2 vs. ARKA_1) can potentially identify activity cliffs, less confident data points, and less modelable data points. We have used here five representative environmentally relevant endpoints (skin sensitization, earthworm toxicity, milk/plasma partitioning, algal toxicity, and rodent carcinogenicity of hazardous chemicals) with graded responses to which the ARKA framework was applied for classification modeling. On comparing the performance of the models generated using conventional QSAR descriptors and the ARKA descriptors, the prediction quality of the models derived from ARKA descriptors was found, based on multiple graded-data validation metrics-derived decision criteria, much better than the models derived from QSAR descriptors signifying the potential of ARKA descriptors in ecotoxicological classification modeling of small data sets. Additionally, this holds true for the Read-Across approach as well, since the Read-Across predictions using ARKA descriptors supersede the predictions generated from QSAR descriptors. For the ease of users, a Java-based expert system has been developed that computes the ARKA descriptors from the input of QSAR descriptors. A scatter plot of the data points using the values of two ARKA descriptors can potentially identify activity cliffs, less confident data points, and less modelable data points.</abstract><cop>England</cop><pub>Royal Society of Chemistry</pub><pmid>38743054</pmid><doi>10.1039/d4em00173g</doi><tpages>17</tpages><orcidid>https://orcid.org/0000-0003-4486-8074</orcidid><orcidid>https://orcid.org/0000-0001-8468-0784</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2050-7887
ispartof	Environmental science--processes & impacts, 2024-06, Vol.26 (6), p.991-17
issn	2050-7887 2050-7895 2050-7895
language	eng
recordid	cdi_rsc_primary_d4em00173g
source	MEDLINE; Royal Society Of Chemistry Journals 2008-
subjects	Algae Animals Carcinogenicity Carcinogens Classification Data points Datasets Environmental Pollutants - chemistry Environmental Pollutants - toxicity Expert systems Machine Learning Modelling Partitioning Prediction models Predictions Quantitative Structure-Activity Relationship Risk assessment Risk Assessment - methods Structure-activity relationships Toxicity Toxicity Tests
title	ARKA: a framework of dimensionality reduction for machine-learning classification modeling, risk assessment, and data gap-filling of sparse environmental toxicity data
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T07%3A05%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_rsc_p&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ARKA:%20a%20framework%20of%20dimensionality%20reduction%20for%20machine-learning%20classification%20modeling,%20risk%20assessment,%20and%20data%20gap-filling%20of%20sparse%20environmental%20toxicity%20data&rft.jtitle=Environmental%20science--processes%20&%20impacts&rft.au=Banerjee,%20Arkaprava&rft.date=2024-06-19&rft.volume=26&rft.issue=6&rft.spage=991&rft.epage=17&rft.pages=991-17&rft.issn=2050-7887&rft.eissn=2050-7895&rft_id=info:doi/10.1039/d4em00173g&rft_dat=%3Cproquest_rsc_p%3E3069495084%3C/proquest_rsc_p%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3069495084&rft_id=info:pmid/38743054&rfr_iscdi=true