A New Efficient Approach for Variable Selection Based on Multiregression:  Prediction of Gas Chromatographic Retention Times and Response Factors

The selection of the most relevant variable is a frequent problem in the analysis of chemical data, especially now considering the large amounts of data created by the increased computer power and analytical resolution. A novel procedure for variable selection based on multiregression (MR) analysis...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of Chemical Information and Computer Sciences 1999-05, Vol.39 (3), p.610-621
Hauptverfasser: Lučić, Bono, Trinajstić, Nenad, Sild, Sulev, Karelson, Mati, Katritzky, Alan R
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 621
container_issue 3
container_start_page 610
container_title Journal of Chemical Information and Computer Sciences
container_volume 39
creator Lučić, Bono
Trinajstić, Nenad
Sild, Sulev
Karelson, Mati
Katritzky, Alan R
description The selection of the most relevant variable is a frequent problem in the analysis of chemical data, especially now considering the large amounts of data created by the increased computer power and analytical resolution. A novel procedure for variable selection based on multiregression (MR) analysis is developed and applied to the quantitative structure−property relationship (QSPR) modeling of gas chromatographic retention times t R and Dietz response factors RF on 152 diverse chemical compounds. Using 296 descriptors generated by the CODESSA program, “absolutely the best” linear MR models containing from 1 to 5 descriptors were first selected (∼2 × 1010 models were checked), and then “the best” linear stepwise MR models with six and seven descriptors were obtained through “i by i” stepwise selection. In this paper i was varied from 1 to 4, so that in each next step i descriptors were added to the previously selected descriptors. Nonlinear models were developed by the inclusion of cross-products of initial descriptors. We selected as the most important descriptors for t R the number of C−H and C−X bonds, connectivity indices of order 3, the highest normal mode vibrational frequency, and the rotational entropy of the molecule at 300 K. In the case of RF modeling the most important descriptors are those related to the relative number and weight of effective C atoms, the orbital electronic population, and the bond order and valency of C and H atoms. Comparison with the best six-descriptor models obtained by the normal CODESSA procedure shows that nonlinear seven-descriptor MR models now obtained achieve 30% (0.3520 vs 0.5032) and 12% (0.0472 vs 0.0530) less standard errors of estimate for t R and RF, respectively. Our novel procedure of selecting a small number of the most important descriptors from a data set allows us to extract a larger amount of useful information than with the procedure implemented in CODESSA. Thus, our new procedure enables the selection of the best possible MR models from 1010 possibilities. Through the introduction of cross-product terms, we obtained nonlinear MR models which are superior to the corresponding linear models.
doi_str_mv 10.1021/ci980161a
format Article
fullrecord <record><control><sourceid>istex_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1021_ci980161a</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>ark_67375_TPS_1TKSB7M2_N</sourcerecordid><originalsourceid>FETCH-LOGICAL-a295t-423483b92b0c74184a2f13a5a95689ebb3c6af6ce181b96e0e05172f740d878c3</originalsourceid><addsrcrecordid>eNpt0DtPAzEMAOAIgUR5DPyDLAwMB3ncK2yl4iUoIFoQW-RLHRpoL6fkKmBjZeYf8ks4VMTEZMv-ZMsmZIezfc4EPzBOlYznHFZIj2epSlTOHlZJjzGVJULKcp1sxPjEmJQqFz3y2adX-EKPrXXGYd3SftMED2ZKrQ_0HoKDaoZ0hDM0rfM1PYKIE9olw8WsdQEfA8bYNQ6_3j_oTcCJWzpv6SlEOpgGP4fWPwZops7QW2y7LT9g7OYYKdSTrhYbX0ekJ2BaH-IWWbMwi7j9GzfJ3cnxeHCWXF6fng_6lwkIlbVJKmRaykqJipki5WUKwnIJGagsLxVWlTQ52NwgL3mlcmTIMl4IW6RsUhalkZtkbznXBB9jQKub4OYQ3jRn-ueb-u-bnU2W1sUWX_8ghGedF7LI9PhmpPn4YnRUDIW-6vzu0oOJ-skvQt1d8s_cb0ImhVs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>A New Efficient Approach for Variable Selection Based on Multiregression:  Prediction of Gas Chromatographic Retention Times and Response Factors</title><source>ACS Publications</source><creator>Lučić, Bono ; Trinajstić, Nenad ; Sild, Sulev ; Karelson, Mati ; Katritzky, Alan R</creator><creatorcontrib>Lučić, Bono ; Trinajstić, Nenad ; Sild, Sulev ; Karelson, Mati ; Katritzky, Alan R</creatorcontrib><description>The selection of the most relevant variable is a frequent problem in the analysis of chemical data, especially now considering the large amounts of data created by the increased computer power and analytical resolution. A novel procedure for variable selection based on multiregression (MR) analysis is developed and applied to the quantitative structure−property relationship (QSPR) modeling of gas chromatographic retention times t R and Dietz response factors RF on 152 diverse chemical compounds. Using 296 descriptors generated by the CODESSA program, “absolutely the best” linear MR models containing from 1 to 5 descriptors were first selected (∼2 × 1010 models were checked), and then “the best” linear stepwise MR models with six and seven descriptors were obtained through “i by i” stepwise selection. In this paper i was varied from 1 to 4, so that in each next step i descriptors were added to the previously selected descriptors. Nonlinear models were developed by the inclusion of cross-products of initial descriptors. We selected as the most important descriptors for t R the number of C−H and C−X bonds, connectivity indices of order 3, the highest normal mode vibrational frequency, and the rotational entropy of the molecule at 300 K. In the case of RF modeling the most important descriptors are those related to the relative number and weight of effective C atoms, the orbital electronic population, and the bond order and valency of C and H atoms. Comparison with the best six-descriptor models obtained by the normal CODESSA procedure shows that nonlinear seven-descriptor MR models now obtained achieve 30% (0.3520 vs 0.5032) and 12% (0.0472 vs 0.0530) less standard errors of estimate for t R and RF, respectively. Our novel procedure of selecting a small number of the most important descriptors from a data set allows us to extract a larger amount of useful information than with the procedure implemented in CODESSA. Thus, our new procedure enables the selection of the best possible MR models from 1010 possibilities. Through the introduction of cross-product terms, we obtained nonlinear MR models which are superior to the corresponding linear models.</description><identifier>ISSN: 0095-2338</identifier><identifier>EISSN: 1549-960X</identifier><identifier>DOI: 10.1021/ci980161a</identifier><language>eng</language><publisher>American Chemical Society</publisher><ispartof>Journal of Chemical Information and Computer Sciences, 1999-05, Vol.39 (3), p.610-621</ispartof><rights>Copyright © 1999 American Chemical Society</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a295t-423483b92b0c74184a2f13a5a95689ebb3c6af6ce181b96e0e05172f740d878c3</citedby><cites>FETCH-LOGICAL-a295t-423483b92b0c74184a2f13a5a95689ebb3c6af6ce181b96e0e05172f740d878c3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/ci980161a$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/ci980161a$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>314,780,784,2765,27076,27924,27925,56738,56788</link.rule.ids></links><search><creatorcontrib>Lučić, Bono</creatorcontrib><creatorcontrib>Trinajstić, Nenad</creatorcontrib><creatorcontrib>Sild, Sulev</creatorcontrib><creatorcontrib>Karelson, Mati</creatorcontrib><creatorcontrib>Katritzky, Alan R</creatorcontrib><title>A New Efficient Approach for Variable Selection Based on Multiregression:  Prediction of Gas Chromatographic Retention Times and Response Factors</title><title>Journal of Chemical Information and Computer Sciences</title><addtitle>J. Chem. Inf. Comput. Sci</addtitle><description>The selection of the most relevant variable is a frequent problem in the analysis of chemical data, especially now considering the large amounts of data created by the increased computer power and analytical resolution. A novel procedure for variable selection based on multiregression (MR) analysis is developed and applied to the quantitative structure−property relationship (QSPR) modeling of gas chromatographic retention times t R and Dietz response factors RF on 152 diverse chemical compounds. Using 296 descriptors generated by the CODESSA program, “absolutely the best” linear MR models containing from 1 to 5 descriptors were first selected (∼2 × 1010 models were checked), and then “the best” linear stepwise MR models with six and seven descriptors were obtained through “i by i” stepwise selection. In this paper i was varied from 1 to 4, so that in each next step i descriptors were added to the previously selected descriptors. Nonlinear models were developed by the inclusion of cross-products of initial descriptors. We selected as the most important descriptors for t R the number of C−H and C−X bonds, connectivity indices of order 3, the highest normal mode vibrational frequency, and the rotational entropy of the molecule at 300 K. In the case of RF modeling the most important descriptors are those related to the relative number and weight of effective C atoms, the orbital electronic population, and the bond order and valency of C and H atoms. Comparison with the best six-descriptor models obtained by the normal CODESSA procedure shows that nonlinear seven-descriptor MR models now obtained achieve 30% (0.3520 vs 0.5032) and 12% (0.0472 vs 0.0530) less standard errors of estimate for t R and RF, respectively. Our novel procedure of selecting a small number of the most important descriptors from a data set allows us to extract a larger amount of useful information than with the procedure implemented in CODESSA. Thus, our new procedure enables the selection of the best possible MR models from 1010 possibilities. Through the introduction of cross-product terms, we obtained nonlinear MR models which are superior to the corresponding linear models.</description><issn>0095-2338</issn><issn>1549-960X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>1999</creationdate><recordtype>article</recordtype><recordid>eNpt0DtPAzEMAOAIgUR5DPyDLAwMB3ncK2yl4iUoIFoQW-RLHRpoL6fkKmBjZeYf8ks4VMTEZMv-ZMsmZIezfc4EPzBOlYznHFZIj2epSlTOHlZJjzGVJULKcp1sxPjEmJQqFz3y2adX-EKPrXXGYd3SftMED2ZKrQ_0HoKDaoZ0hDM0rfM1PYKIE9olw8WsdQEfA8bYNQ6_3j_oTcCJWzpv6SlEOpgGP4fWPwZops7QW2y7LT9g7OYYKdSTrhYbX0ekJ2BaH-IWWbMwi7j9GzfJ3cnxeHCWXF6fng_6lwkIlbVJKmRaykqJipki5WUKwnIJGagsLxVWlTQ52NwgL3mlcmTIMl4IW6RsUhalkZtkbznXBB9jQKub4OYQ3jRn-ueb-u-bnU2W1sUWX_8ghGedF7LI9PhmpPn4YnRUDIW-6vzu0oOJ-skvQt1d8s_cb0ImhVs</recordid><startdate>19990525</startdate><enddate>19990525</enddate><creator>Lučić, Bono</creator><creator>Trinajstić, Nenad</creator><creator>Sild, Sulev</creator><creator>Karelson, Mati</creator><creator>Katritzky, Alan R</creator><general>American Chemical Society</general><scope>BSCLL</scope><scope>AAYXX</scope><scope>CITATION</scope></search><sort><creationdate>19990525</creationdate><title>A New Efficient Approach for Variable Selection Based on Multiregression:  Prediction of Gas Chromatographic Retention Times and Response Factors</title><author>Lučić, Bono ; Trinajstić, Nenad ; Sild, Sulev ; Karelson, Mati ; Katritzky, Alan R</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a295t-423483b92b0c74184a2f13a5a95689ebb3c6af6ce181b96e0e05172f740d878c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>1999</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Lučić, Bono</creatorcontrib><creatorcontrib>Trinajstić, Nenad</creatorcontrib><creatorcontrib>Sild, Sulev</creatorcontrib><creatorcontrib>Karelson, Mati</creatorcontrib><creatorcontrib>Katritzky, Alan R</creatorcontrib><collection>Istex</collection><collection>CrossRef</collection><jtitle>Journal of Chemical Information and Computer Sciences</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Lučić, Bono</au><au>Trinajstić, Nenad</au><au>Sild, Sulev</au><au>Karelson, Mati</au><au>Katritzky, Alan R</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>A New Efficient Approach for Variable Selection Based on Multiregression:  Prediction of Gas Chromatographic Retention Times and Response Factors</atitle><jtitle>Journal of Chemical Information and Computer Sciences</jtitle><addtitle>J. Chem. Inf. Comput. Sci</addtitle><date>1999-05-25</date><risdate>1999</risdate><volume>39</volume><issue>3</issue><spage>610</spage><epage>621</epage><pages>610-621</pages><issn>0095-2338</issn><eissn>1549-960X</eissn><abstract>The selection of the most relevant variable is a frequent problem in the analysis of chemical data, especially now considering the large amounts of data created by the increased computer power and analytical resolution. A novel procedure for variable selection based on multiregression (MR) analysis is developed and applied to the quantitative structure−property relationship (QSPR) modeling of gas chromatographic retention times t R and Dietz response factors RF on 152 diverse chemical compounds. Using 296 descriptors generated by the CODESSA program, “absolutely the best” linear MR models containing from 1 to 5 descriptors were first selected (∼2 × 1010 models were checked), and then “the best” linear stepwise MR models with six and seven descriptors were obtained through “i by i” stepwise selection. In this paper i was varied from 1 to 4, so that in each next step i descriptors were added to the previously selected descriptors. Nonlinear models were developed by the inclusion of cross-products of initial descriptors. We selected as the most important descriptors for t R the number of C−H and C−X bonds, connectivity indices of order 3, the highest normal mode vibrational frequency, and the rotational entropy of the molecule at 300 K. In the case of RF modeling the most important descriptors are those related to the relative number and weight of effective C atoms, the orbital electronic population, and the bond order and valency of C and H atoms. Comparison with the best six-descriptor models obtained by the normal CODESSA procedure shows that nonlinear seven-descriptor MR models now obtained achieve 30% (0.3520 vs 0.5032) and 12% (0.0472 vs 0.0530) less standard errors of estimate for t R and RF, respectively. Our novel procedure of selecting a small number of the most important descriptors from a data set allows us to extract a larger amount of useful information than with the procedure implemented in CODESSA. Thus, our new procedure enables the selection of the best possible MR models from 1010 possibilities. Through the introduction of cross-product terms, we obtained nonlinear MR models which are superior to the corresponding linear models.</abstract><pub>American Chemical Society</pub><doi>10.1021/ci980161a</doi><tpages>12</tpages></addata></record>
fulltext fulltext
identifier ISSN: 0095-2338
ispartof Journal of Chemical Information and Computer Sciences, 1999-05, Vol.39 (3), p.610-621
issn 0095-2338
1549-960X
language eng
recordid cdi_crossref_primary_10_1021_ci980161a
source ACS Publications
title A New Efficient Approach for Variable Selection Based on Multiregression:  Prediction of Gas Chromatographic Retention Times and Response Factors
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T01%3A35%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-istex_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=A%20New%20Efficient%20Approach%20for%20Variable%20Selection%20Based%20on%20Multiregression:%E2%80%89%20Prediction%20of%20Gas%20Chromatographic%20Retention%20Times%20and%20Response%20Factors&rft.jtitle=Journal%20of%20Chemical%20Information%20and%20Computer%20Sciences&rft.au=Luc%CC%8Ci%C4%87,%20Bono&rft.date=1999-05-25&rft.volume=39&rft.issue=3&rft.spage=610&rft.epage=621&rft.pages=610-621&rft.issn=0095-2338&rft.eissn=1549-960X&rft_id=info:doi/10.1021/ci980161a&rft_dat=%3Cistex_cross%3Eark_67375_TPS_1TKSB7M2_N%3C/istex_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true