Data Mining Approach for Extraction of Useful Information About Biologically Active Compounds from Publications

A lot of high quality data on the biological activity of chemical compounds are required throughout the whole drug discovery process: from development of computational models of the structure–activity relationship to experimental testing of lead compounds and their validation in clinics. Currently,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of chemical information and modeling 2019-09, Vol.59 (9), p.3635-3644
Hauptverfasser:	Tarasova, Olga A, Biziukova, Nadezhda Yu, Filimonov, Dmitry A, Poroikov, Vladimir V, Nicklaus, Marc C
Format:	Artikel
Sprache:	eng
Schlagworte:	Bioassays Biological activity Chemical activity Chemical compounds Data mining Documents Fragments Lead compounds Machine learning Organic chemistry Scientific papers Toxicity
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	3644
container_issue	9
container_start_page	3635
container_title	Journal of chemical information and modeling
container_volume	59
creator	Tarasova, Olga A Biziukova, Nadezhda Yu Filimonov, Dmitry A Poroikov, Vladimir V Nicklaus, Marc C
description	A lot of high quality data on the biological activity of chemical compounds are required throughout the whole drug discovery process: from development of computational models of the structure–activity relationship to experimental testing of lead compounds and their validation in clinics. Currently, a large amount of such data is available from databases, scientific publications, and patents. Biological data are characterized by incompleteness, uncertainty, and low reproducibility. Despite the existence of free and commercially available databases of biological activities of compounds, they usually lack unambiguous information about peculiarities of biological assays. On the other hand, scientific papers are the primary source of new data disclosed to the scientific community for the first time. In this study, we have developed and validated a data-mining approach for extraction of text fragments containing description of bioassays. We have used this approach to evaluate compounds and their biological activity reported in scientific publications. We have found that categorization of papers into relevant and irrelevant may be performed based on the machine-learning analysis of the abstracts. Text fragments extracted from the full texts of publications allow their further partitioning into several classes according to the peculiarities of bioassays. We demonstrate the applicability of our approach to the comparison of the endpoint values of biological activity and cytotoxicity of reference compounds.
doi_str_mv	10.1021/acs.jcim.9b00164
format	Article
fullrecord	<record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_8194363</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2281114753</sourcerecordid><originalsourceid>FETCH-LOGICAL-a498t-b37425265e512ef120615da8bf42b9f754174df9ca9cc960334c608cc5be6f653</originalsourceid><addsrcrecordid>eNp1kU1vEzEYhK2qiJbCnROy1EsPJPg76wtSGgpUalUOVOJmeR07deS1t_ZuRf89TpNUBYmTLb_PzOvRAPAeoylGBH_SpkzXxndT2SKEBTsAx5gzOZEC_Trc37kUR-BNKWuEKJWCvAZHFDNOhWTHIH3Rg4bXPvq4gvO-z0mbO-hShhe_h6zN4FOEycHbYt0Y4GWso04_vc7bNA7w3KeQVt7oEB7hvPIPFi5S16cxLgt0OXXwx9iGCmxE5S145XQo9t3uPAG3Xy9-Lr5Prm6-XS7mVxPNZDNMWjpjhBPBLcfEOkyQwHypm9Yx0ko34wzP2NJJo6UxNS2lzAjUGMNbK5zg9AR83vr2Y9vZpbGxpgmqz77T-VEl7dXfk-jv1Co9qAZLRgWtBmc7g5zuR1sG1flibAg62jQWRUiDMWYzvkFP_0HXacyxxquUFIJKRFil0JYyOZWSrXv-DEZq06aqbapNm2rXZpV8eBniWbCvrwIft8CTdL_0v35_AG22rVE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2296639024</pqid></control><display><type>article</type><title>Data Mining Approach for Extraction of Useful Information About Biologically Active Compounds from Publications</title><source>American Chemical Society Journals</source><creator>Tarasova, Olga A ; Biziukova, Nadezhda Yu ; Filimonov, Dmitry A ; Poroikov, Vladimir V ; Nicklaus, Marc C</creator><creatorcontrib>Tarasova, Olga A ; Biziukova, Nadezhda Yu ; Filimonov, Dmitry A ; Poroikov, Vladimir V ; Nicklaus, Marc C</creatorcontrib><description>A lot of high quality data on the biological activity of chemical compounds are required throughout the whole drug discovery process: from development of computational models of the structure–activity relationship to experimental testing of lead compounds and their validation in clinics. Currently, a large amount of such data is available from databases, scientific publications, and patents. Biological data are characterized by incompleteness, uncertainty, and low reproducibility. Despite the existence of free and commercially available databases of biological activities of compounds, they usually lack unambiguous information about peculiarities of biological assays. On the other hand, scientific papers are the primary source of new data disclosed to the scientific community for the first time. In this study, we have developed and validated a data-mining approach for extraction of text fragments containing description of bioassays. We have used this approach to evaluate compounds and their biological activity reported in scientific publications. We have found that categorization of papers into relevant and irrelevant may be performed based on the machine-learning analysis of the abstracts. Text fragments extracted from the full texts of publications allow their further partitioning into several classes according to the peculiarities of bioassays. We demonstrate the applicability of our approach to the comparison of the endpoint values of biological activity and cytotoxicity of reference compounds.</description><identifier>ISSN: 1549-9596</identifier><identifier>EISSN: 1549-960X</identifier><identifier>DOI: 10.1021/acs.jcim.9b00164</identifier><identifier>PMID: 31453694</identifier><language>eng</language><publisher>United States: American Chemical Society</publisher><subject>Bioassays ; Biological activity ; Chemical activity ; Chemical compounds ; Data mining ; Documents ; Fragments ; Lead compounds ; Machine learning ; Organic chemistry ; Scientific papers ; Toxicity</subject><ispartof>Journal of chemical information and modeling, 2019-09, Vol.59 (9), p.3635-3644</ispartof><rights>Copyright American Chemical Society Sep 23, 2019</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a498t-b37425265e512ef120615da8bf42b9f754174df9ca9cc960334c608cc5be6f653</citedby><cites>FETCH-LOGICAL-a498t-b37425265e512ef120615da8bf42b9f754174df9ca9cc960334c608cc5be6f653</cites><orcidid>0000-0002-0339-8478 ; 0000-0001-7937-2621 ; 0000-0002-3723-7832</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://pubs.acs.org/doi/pdf/10.1021/acs.jcim.9b00164$$EPDF$$P50$$Gacs$$H</linktopdf><linktohtml>$$Uhttps://pubs.acs.org/doi/10.1021/acs.jcim.9b00164$$EHTML$$P50$$Gacs$$H</linktohtml><link.rule.ids>230,315,781,785,886,2766,27081,27929,27930,56743,56793</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/31453694$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Tarasova, Olga A</creatorcontrib><creatorcontrib>Biziukova, Nadezhda Yu</creatorcontrib><creatorcontrib>Filimonov, Dmitry A</creatorcontrib><creatorcontrib>Poroikov, Vladimir V</creatorcontrib><creatorcontrib>Nicklaus, Marc C</creatorcontrib><title>Data Mining Approach for Extraction of Useful Information About Biologically Active Compounds from Publications</title><title>Journal of chemical information and modeling</title><addtitle>J. Chem. Inf. Model</addtitle><description>A lot of high quality data on the biological activity of chemical compounds are required throughout the whole drug discovery process: from development of computational models of the structure–activity relationship to experimental testing of lead compounds and their validation in clinics. Currently, a large amount of such data is available from databases, scientific publications, and patents. Biological data are characterized by incompleteness, uncertainty, and low reproducibility. Despite the existence of free and commercially available databases of biological activities of compounds, they usually lack unambiguous information about peculiarities of biological assays. On the other hand, scientific papers are the primary source of new data disclosed to the scientific community for the first time. In this study, we have developed and validated a data-mining approach for extraction of text fragments containing description of bioassays. We have used this approach to evaluate compounds and their biological activity reported in scientific publications. We have found that categorization of papers into relevant and irrelevant may be performed based on the machine-learning analysis of the abstracts. Text fragments extracted from the full texts of publications allow their further partitioning into several classes according to the peculiarities of bioassays. We demonstrate the applicability of our approach to the comparison of the endpoint values of biological activity and cytotoxicity of reference compounds.</description><subject>Bioassays</subject><subject>Biological activity</subject><subject>Chemical activity</subject><subject>Chemical compounds</subject><subject>Data mining</subject><subject>Documents</subject><subject>Fragments</subject><subject>Lead compounds</subject><subject>Machine learning</subject><subject>Organic chemistry</subject><subject>Scientific papers</subject><subject>Toxicity</subject><issn>1549-9596</issn><issn>1549-960X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><recordid>eNp1kU1vEzEYhK2qiJbCnROy1EsPJPg76wtSGgpUalUOVOJmeR07deS1t_ZuRf89TpNUBYmTLb_PzOvRAPAeoylGBH_SpkzXxndT2SKEBTsAx5gzOZEC_Trc37kUR-BNKWuEKJWCvAZHFDNOhWTHIH3Rg4bXPvq4gvO-z0mbO-hShhe_h6zN4FOEycHbYt0Y4GWso04_vc7bNA7w3KeQVt7oEB7hvPIPFi5S16cxLgt0OXXwx9iGCmxE5S145XQo9t3uPAG3Xy9-Lr5Prm6-XS7mVxPNZDNMWjpjhBPBLcfEOkyQwHypm9Yx0ko34wzP2NJJo6UxNS2lzAjUGMNbK5zg9AR83vr2Y9vZpbGxpgmqz77T-VEl7dXfk-jv1Co9qAZLRgWtBmc7g5zuR1sG1flibAg62jQWRUiDMWYzvkFP_0HXacyxxquUFIJKRFil0JYyOZWSrXv-DEZq06aqbapNm2rXZpV8eBniWbCvrwIft8CTdL_0v35_AG22rVE</recordid><startdate>20190923</startdate><enddate>20190923</enddate><creator>Tarasova, Olga A</creator><creator>Biziukova, Nadezhda Yu</creator><creator>Filimonov, Dmitry A</creator><creator>Poroikov, Vladimir V</creator><creator>Nicklaus, Marc C</creator><general>American Chemical Society</general><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SR</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-0339-8478</orcidid><orcidid>https://orcid.org/0000-0001-7937-2621</orcidid><orcidid>https://orcid.org/0000-0002-3723-7832</orcidid></search><sort><creationdate>20190923</creationdate><title>Data Mining Approach for Extraction of Useful Information About Biologically Active Compounds from Publications</title><author>Tarasova, Olga A ; Biziukova, Nadezhda Yu ; Filimonov, Dmitry A ; Poroikov, Vladimir V ; Nicklaus, Marc C</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a498t-b37425265e512ef120615da8bf42b9f754174df9ca9cc960334c608cc5be6f653</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Bioassays</topic><topic>Biological activity</topic><topic>Chemical activity</topic><topic>Chemical compounds</topic><topic>Data mining</topic><topic>Documents</topic><topic>Fragments</topic><topic>Lead compounds</topic><topic>Machine learning</topic><topic>Organic chemistry</topic><topic>Scientific papers</topic><topic>Toxicity</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tarasova, Olga A</creatorcontrib><creatorcontrib>Biziukova, Nadezhda Yu</creatorcontrib><creatorcontrib>Filimonov, Dmitry A</creatorcontrib><creatorcontrib>Poroikov, Vladimir V</creatorcontrib><creatorcontrib>Nicklaus, Marc C</creatorcontrib><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of chemical information and modeling</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Tarasova, Olga A</au><au>Biziukova, Nadezhda Yu</au><au>Filimonov, Dmitry A</au><au>Poroikov, Vladimir V</au><au>Nicklaus, Marc C</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Data Mining Approach for Extraction of Useful Information About Biologically Active Compounds from Publications</atitle><jtitle>Journal of chemical information and modeling</jtitle><addtitle>J. Chem. Inf. Model</addtitle><date>2019-09-23</date><risdate>2019</risdate><volume>59</volume><issue>9</issue><spage>3635</spage><epage>3644</epage><pages>3635-3644</pages><issn>1549-9596</issn><eissn>1549-960X</eissn><abstract>A lot of high quality data on the biological activity of chemical compounds are required throughout the whole drug discovery process: from development of computational models of the structure–activity relationship to experimental testing of lead compounds and their validation in clinics. Currently, a large amount of such data is available from databases, scientific publications, and patents. Biological data are characterized by incompleteness, uncertainty, and low reproducibility. Despite the existence of free and commercially available databases of biological activities of compounds, they usually lack unambiguous information about peculiarities of biological assays. On the other hand, scientific papers are the primary source of new data disclosed to the scientific community for the first time. In this study, we have developed and validated a data-mining approach for extraction of text fragments containing description of bioassays. We have used this approach to evaluate compounds and their biological activity reported in scientific publications. We have found that categorization of papers into relevant and irrelevant may be performed based on the machine-learning analysis of the abstracts. Text fragments extracted from the full texts of publications allow their further partitioning into several classes according to the peculiarities of bioassays. We demonstrate the applicability of our approach to the comparison of the endpoint values of biological activity and cytotoxicity of reference compounds.</abstract><cop>United States</cop><pub>American Chemical Society</pub><pmid>31453694</pmid><doi>10.1021/acs.jcim.9b00164</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0002-0339-8478</orcidid><orcidid>https://orcid.org/0000-0001-7937-2621</orcidid><orcidid>https://orcid.org/0000-0002-3723-7832</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1549-9596
ispartof	Journal of chemical information and modeling, 2019-09, Vol.59 (9), p.3635-3644
issn	1549-9596 1549-960X
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_8194363
source	American Chemical Society Journals
subjects	Bioassays Biological activity Chemical activity Chemical compounds Data mining Documents Fragments Lead compounds Machine learning Organic chemistry Scientific papers Toxicity
title	Data Mining Approach for Extraction of Useful Information About Biologically Active Compounds from Publications
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-10T20%3A20%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Data%20Mining%20Approach%20for%20Extraction%20of%20Useful%20Information%20About%20Biologically%20Active%20Compounds%20from%20Publications&rft.jtitle=Journal%20of%20chemical%20information%20and%20modeling&rft.au=Tarasova,%20Olga%20A&rft.date=2019-09-23&rft.volume=59&rft.issue=9&rft.spage=3635&rft.epage=3644&rft.pages=3635-3644&rft.issn=1549-9596&rft.eissn=1549-960X&rft_id=info:doi/10.1021/acs.jcim.9b00164&rft_dat=%3Cproquest_pubme%3E2281114753%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2296639024&rft_id=info:pmid/31453694&rfr_iscdi=true