promSEMBLE: Hard Pattern Mining and Ensemble Learning for Detecting DNA Promoter Sequences

Accurate identification of DNA promoter sequences is of crucial importance in unraveling the underlying mechanisms that regulate gene transcription. Initiation of transcription is controlled through regulatory transcription factors binding to promoter core regions in the DNA sequence. Detection of p...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on computational biology and bioinformatics 2024-01, Vol.21 (1), p.208-214
Hauptverfasser: Nagda, Bindi M., Nguyen, Van Minh, White, Ryan T.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 214
container_issue 1
container_start_page 208
container_title IEEE/ACM transactions on computational biology and bioinformatics
container_volume 21
creator Nagda, Bindi M.
Nguyen, Van Minh
White, Ryan T.
description Accurate identification of DNA promoter sequences is of crucial importance in unraveling the underlying mechanisms that regulate gene transcription. Initiation of transcription is controlled through regulatory transcription factors binding to promoter core regions in the DNA sequence. Detection of promoter regions is necessary if we are to build genetic regulatory networks for biomedical and clinical applications, and for identification of rarely expressed genes. We propose a novel ensemble learning technique using deep recurrent neural networks with convolutional feature extraction and hard negative pattern mining to detect several types of promoter sequences, including promoter sequences with the TATA-box and without the TATA-box, within DNA sequences of four different species. Using extensive independent tests and previously published results, we demonstrate that our method sets a new state-of-the-art of over 98% Matthews correlation coefficient in all eight organism categories for recognizing the stretch of base pairs that code for the promoter region within DNA sequences.
doi_str_mv 10.1109/TCBB.2023.3339597
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2923118964</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10342761</ieee_id><sourcerecordid>2923118964</sourcerecordid><originalsourceid>FETCH-LOGICAL-c302t-42ba7dca36a344af30274fc6ba2dcedc1ef4a5d4878a6601c921a0c58910ccd43</originalsourceid><addsrcrecordid>eNpdkM1LwzAchoMobn78AYJIwIuXzny38ebmdMKmA_XiJWTpr9Kxtpp0B_97UzdFPCX58bxvkgehE0oGlBJ9-TwaDgeMMD7gnGup0x3Up1KmidZK7HZ7IROpFe-hgxCWhDChidhHPZ4RSRVVffT67pvqaTwbTsdXeGJ9jue2bcHXeFbWZf2GbZ3jcR2gWqwAT8H672nReHwDLbi2O908XON57GliED_BxxpqB-EI7RV2FeB4ux6il9vx82iSTB_v7kfX08RxwtpEsIVNc2e5slwIW8RhKgqnFpblDnJHoRBW5iJLM6sUoU4zaomTmabEuVzwQ3Sx6Y1fiVeH1lRlcLBa2RqadTAs05mWnCkS0fN_6LJZ-zq-zjDNOKVZFBcpuqGcb0LwUJh3X1bWfxpKTCfedOJNJ95sxcfM2bZ5vagg_038mI7A6QYoAeBPIRcsVZR_AcUxhaw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2923118964</pqid></control><display><type>article</type><title>promSEMBLE: Hard Pattern Mining and Ensemble Learning for Detecting DNA Promoter Sequences</title><source>IEEE Electronic Library (IEL)</source><creator>Nagda, Bindi M. ; Nguyen, Van Minh ; White, Ryan T.</creator><creatorcontrib>Nagda, Bindi M. ; Nguyen, Van Minh ; White, Ryan T.</creatorcontrib><description>Accurate identification of DNA promoter sequences is of crucial importance in unraveling the underlying mechanisms that regulate gene transcription. Initiation of transcription is controlled through regulatory transcription factors binding to promoter core regions in the DNA sequence. Detection of promoter regions is necessary if we are to build genetic regulatory networks for biomedical and clinical applications, and for identification of rarely expressed genes. We propose a novel ensemble learning technique using deep recurrent neural networks with convolutional feature extraction and hard negative pattern mining to detect several types of promoter sequences, including promoter sequences with the TATA-box and without the TATA-box, within DNA sequences of four different species. Using extensive independent tests and previously published results, we demonstrate that our method sets a new state-of-the-art of over 98% Matthews correlation coefficient in all eight organism categories for recognizing the stretch of base pairs that code for the promoter region within DNA sequences.</description><identifier>ISSN: 1545-5963</identifier><identifier>EISSN: 1557-9964</identifier><identifier>DOI: 10.1109/TCBB.2023.3339597</identifier><identifier>PMID: 38051616</identifier><identifier>CODEN: ITCBCY</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Base Sequence ; Bioinformatics ; Biological system modeling ; Correlation coefficient ; Correlation coefficients ; Data mining ; Deoxyribonucleic acid ; DNA ; DNA - genetics ; DNA sequences ; Ensemble learning ; Feature extraction ; Gene sequencing ; Genomics ; Machine Learning ; Neural networks ; Nucleotide sequence ; Pattern analysis ; Promoter regions ; Promoter Regions, Genetic - genetics ; Recurrent neural networks ; Regulatory sequences ; TATA Box ; Transcription factors ; Transcription initiation ; Transcription, Genetic</subject><ispartof>IEEE/ACM transactions on computational biology and bioinformatics, 2024-01, Vol.21 (1), p.208-214</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c302t-42ba7dca36a344af30274fc6ba2dcedc1ef4a5d4878a6601c921a0c58910ccd43</cites><orcidid>0000-0003-3507-5494 ; 0000-0002-2479-2503 ; 0000-0002-5524-629X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10342761$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10342761$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38051616$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Nagda, Bindi M.</creatorcontrib><creatorcontrib>Nguyen, Van Minh</creatorcontrib><creatorcontrib>White, Ryan T.</creatorcontrib><title>promSEMBLE: Hard Pattern Mining and Ensemble Learning for Detecting DNA Promoter Sequences</title><title>IEEE/ACM transactions on computational biology and bioinformatics</title><addtitle>TCBB</addtitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><description>Accurate identification of DNA promoter sequences is of crucial importance in unraveling the underlying mechanisms that regulate gene transcription. Initiation of transcription is controlled through regulatory transcription factors binding to promoter core regions in the DNA sequence. Detection of promoter regions is necessary if we are to build genetic regulatory networks for biomedical and clinical applications, and for identification of rarely expressed genes. We propose a novel ensemble learning technique using deep recurrent neural networks with convolutional feature extraction and hard negative pattern mining to detect several types of promoter sequences, including promoter sequences with the TATA-box and without the TATA-box, within DNA sequences of four different species. Using extensive independent tests and previously published results, we demonstrate that our method sets a new state-of-the-art of over 98% Matthews correlation coefficient in all eight organism categories for recognizing the stretch of base pairs that code for the promoter region within DNA sequences.</description><subject>Base Sequence</subject><subject>Bioinformatics</subject><subject>Biological system modeling</subject><subject>Correlation coefficient</subject><subject>Correlation coefficients</subject><subject>Data mining</subject><subject>Deoxyribonucleic acid</subject><subject>DNA</subject><subject>DNA - genetics</subject><subject>DNA sequences</subject><subject>Ensemble learning</subject><subject>Feature extraction</subject><subject>Gene sequencing</subject><subject>Genomics</subject><subject>Machine Learning</subject><subject>Neural networks</subject><subject>Nucleotide sequence</subject><subject>Pattern analysis</subject><subject>Promoter regions</subject><subject>Promoter Regions, Genetic - genetics</subject><subject>Recurrent neural networks</subject><subject>Regulatory sequences</subject><subject>TATA Box</subject><subject>Transcription factors</subject><subject>Transcription initiation</subject><subject>Transcription, Genetic</subject><issn>1545-5963</issn><issn>1557-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><sourceid>EIF</sourceid><recordid>eNpdkM1LwzAchoMobn78AYJIwIuXzny38ebmdMKmA_XiJWTpr9Kxtpp0B_97UzdFPCX58bxvkgehE0oGlBJ9-TwaDgeMMD7gnGup0x3Up1KmidZK7HZ7IROpFe-hgxCWhDChidhHPZ4RSRVVffT67pvqaTwbTsdXeGJ9jue2bcHXeFbWZf2GbZ3jcR2gWqwAT8H672nReHwDLbi2O908XON57GliED_BxxpqB-EI7RV2FeB4ux6il9vx82iSTB_v7kfX08RxwtpEsIVNc2e5slwIW8RhKgqnFpblDnJHoRBW5iJLM6sUoU4zaomTmabEuVzwQ3Sx6Y1fiVeH1lRlcLBa2RqadTAs05mWnCkS0fN_6LJZ-zq-zjDNOKVZFBcpuqGcb0LwUJh3X1bWfxpKTCfedOJNJ95sxcfM2bZ5vagg_038mI7A6QYoAeBPIRcsVZR_AcUxhaw</recordid><startdate>202401</startdate><enddate>202401</enddate><creator>Nagda, Bindi M.</creator><creator>Nguyen, Van Minh</creator><creator>White, Ryan T.</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QF</scope><scope>7QO</scope><scope>7QQ</scope><scope>7SC</scope><scope>7SE</scope><scope>7SP</scope><scope>7SR</scope><scope>7TA</scope><scope>7TB</scope><scope>7U5</scope><scope>8BQ</scope><scope>8FD</scope><scope>F28</scope><scope>FR3</scope><scope>H8D</scope><scope>JG9</scope><scope>JQ2</scope><scope>KR7</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>P64</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-3507-5494</orcidid><orcidid>https://orcid.org/0000-0002-2479-2503</orcidid><orcidid>https://orcid.org/0000-0002-5524-629X</orcidid></search><sort><creationdate>202401</creationdate><title>promSEMBLE: Hard Pattern Mining and Ensemble Learning for Detecting DNA Promoter Sequences</title><author>Nagda, Bindi M. ; Nguyen, Van Minh ; White, Ryan T.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c302t-42ba7dca36a344af30274fc6ba2dcedc1ef4a5d4878a6601c921a0c58910ccd43</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Base Sequence</topic><topic>Bioinformatics</topic><topic>Biological system modeling</topic><topic>Correlation coefficient</topic><topic>Correlation coefficients</topic><topic>Data mining</topic><topic>Deoxyribonucleic acid</topic><topic>DNA</topic><topic>DNA - genetics</topic><topic>DNA sequences</topic><topic>Ensemble learning</topic><topic>Feature extraction</topic><topic>Gene sequencing</topic><topic>Genomics</topic><topic>Machine Learning</topic><topic>Neural networks</topic><topic>Nucleotide sequence</topic><topic>Pattern analysis</topic><topic>Promoter regions</topic><topic>Promoter Regions, Genetic - genetics</topic><topic>Recurrent neural networks</topic><topic>Regulatory sequences</topic><topic>TATA Box</topic><topic>Transcription factors</topic><topic>Transcription initiation</topic><topic>Transcription, Genetic</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Nagda, Bindi M.</creatorcontrib><creatorcontrib>Nguyen, Van Minh</creatorcontrib><creatorcontrib>White, Ryan T.</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Aluminium Industry Abstracts</collection><collection>Biotechnology Research Abstracts</collection><collection>Ceramic Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Corrosion Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>Materials Business File</collection><collection>Mechanical &amp; Transportation Engineering Abstracts</collection><collection>Solid State and Superconductivity Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>ANTE: Abstracts in New Technology &amp; Engineering</collection><collection>Engineering Research Database</collection><collection>Aerospace Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Civil Engineering Abstracts</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Nagda, Bindi M.</au><au>Nguyen, Van Minh</au><au>White, Ryan T.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>promSEMBLE: Hard Pattern Mining and Ensemble Learning for Detecting DNA Promoter Sequences</atitle><jtitle>IEEE/ACM transactions on computational biology and bioinformatics</jtitle><stitle>TCBB</stitle><addtitle>IEEE/ACM Trans Comput Biol Bioinform</addtitle><date>2024-01</date><risdate>2024</risdate><volume>21</volume><issue>1</issue><spage>208</spage><epage>214</epage><pages>208-214</pages><issn>1545-5963</issn><eissn>1557-9964</eissn><coden>ITCBCY</coden><abstract>Accurate identification of DNA promoter sequences is of crucial importance in unraveling the underlying mechanisms that regulate gene transcription. Initiation of transcription is controlled through regulatory transcription factors binding to promoter core regions in the DNA sequence. Detection of promoter regions is necessary if we are to build genetic regulatory networks for biomedical and clinical applications, and for identification of rarely expressed genes. We propose a novel ensemble learning technique using deep recurrent neural networks with convolutional feature extraction and hard negative pattern mining to detect several types of promoter sequences, including promoter sequences with the TATA-box and without the TATA-box, within DNA sequences of four different species. Using extensive independent tests and previously published results, we demonstrate that our method sets a new state-of-the-art of over 98% Matthews correlation coefficient in all eight organism categories for recognizing the stretch of base pairs that code for the promoter region within DNA sequences.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>38051616</pmid><doi>10.1109/TCBB.2023.3339597</doi><tpages>7</tpages><orcidid>https://orcid.org/0000-0003-3507-5494</orcidid><orcidid>https://orcid.org/0000-0002-2479-2503</orcidid><orcidid>https://orcid.org/0000-0002-5524-629X</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1545-5963
ispartof IEEE/ACM transactions on computational biology and bioinformatics, 2024-01, Vol.21 (1), p.208-214
issn 1545-5963
1557-9964
language eng
recordid cdi_proquest_journals_2923118964
source IEEE Electronic Library (IEL)
subjects Base Sequence
Bioinformatics
Biological system modeling
Correlation coefficient
Correlation coefficients
Data mining
Deoxyribonucleic acid
DNA
DNA - genetics
DNA sequences
Ensemble learning
Feature extraction
Gene sequencing
Genomics
Machine Learning
Neural networks
Nucleotide sequence
Pattern analysis
Promoter regions
Promoter Regions, Genetic - genetics
Recurrent neural networks
Regulatory sequences
TATA Box
Transcription factors
Transcription initiation
Transcription, Genetic
title promSEMBLE: Hard Pattern Mining and Ensemble Learning for Detecting DNA Promoter Sequences
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-20T14%3A43%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=promSEMBLE:%20Hard%20Pattern%20Mining%20and%20Ensemble%20Learning%20for%20Detecting%20DNA%20Promoter%20Sequences&rft.jtitle=IEEE/ACM%20transactions%20on%20computational%20biology%20and%20bioinformatics&rft.au=Nagda,%20Bindi%20M.&rft.date=2024-01&rft.volume=21&rft.issue=1&rft.spage=208&rft.epage=214&rft.pages=208-214&rft.issn=1545-5963&rft.eissn=1557-9964&rft.coden=ITCBCY&rft_id=info:doi/10.1109/TCBB.2023.3339597&rft_dat=%3Cproquest_RIE%3E2923118964%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2923118964&rft_id=info:pmid/38051616&rft_ieee_id=10342761&rfr_iscdi=true