The power of detecting enriched patterns: an HMM approach

The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The se...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of computational biology 2010-04, Vol.17 (4), p.581-592
Hauptverfasser: Zhai, Zhiyuan, Ku, Shih-Yen, Luan, Yihui, Reinert, Gesine, Waterman, Michael S, Sun, Fengzhu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 592
container_issue 4
container_start_page 581
container_title Journal of computational biology
container_volume 17
creator Zhai, Zhiyuan
Ku, Shih-Yen
Luan, Yihui
Reinert, Gesine
Waterman, Michael S
Sun, Fengzhu
description The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.
doi_str_mv 10.1089/cmb.2009.0218
format Article
fullrecord <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3203519</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A226162798</galeid><sourcerecordid>A226162798</sourcerecordid><originalsourceid>FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</originalsourceid><addsrcrecordid>eNqFkstrFTEUxoNU7EOXbmWgC93MNY_JqwuhFNsKLW7qOuQmZ-6NzCTTZK7if2-G2xYLomSRkPzOd04-PoTeErwiWOmPblyvKMZ6hSlRL9AR4Vy2SghxUM9YiJZTKQ_RcSnfMSZMYPkKHVLcUSE0OUL6bgvNlH5CblLfeJjBzSFuGog5uC34ZrLzDDmWs8bG5vr2trHTlJN129foZW-HAm8e9hP07fLz3cV1e_P16svF-U3rONFzC50DK6XHa84BvPay44QrRWXnVb-WlDumgare9wwTrzVXHhPSaacUEVSyE_Rprzvt1iN4B3HOdjBTDqPNv0yywTx_iWFrNumHYRSzOkIVeP8gkNP9DspsxlAcDIONkHbFSMY0rW1FJT_8kySSs2qvYuz_KKNMEUo7UtHTPbqxA5gQ-1THdAtuzikVyye1qtTqL1RdHsbgUoQ-1PtnBe2-wOVUSob-yRKCzRINU6NhlmiYJRqVf_enj0_0YxbYb6mPsUo</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1323812241</pqid></control><display><type>article</type><title>The power of detecting enriched patterns: an HMM approach</title><source>Mary Ann Liebert Online Subscription</source><source>MEDLINE</source><source>Alma/SFX Local Collection</source><creator>Zhai, Zhiyuan ; Ku, Shih-Yen ; Luan, Yihui ; Reinert, Gesine ; Waterman, Michael S ; Sun, Fengzhu</creator><creatorcontrib>Zhai, Zhiyuan ; Ku, Shih-Yen ; Luan, Yihui ; Reinert, Gesine ; Waterman, Michael S ; Sun, Fengzhu</creatorcontrib><description>The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.</description><identifier>ISSN: 1066-5277</identifier><identifier>EISSN: 1557-8666</identifier><identifier>DOI: 10.1089/cmb.2009.0218</identifier><identifier>PMID: 20426691</identifier><language>eng</language><publisher>United States: Mary Ann Liebert, Inc</publisher><subject>Approximation ; Base Composition - genetics ; Base Sequence ; Binding sites ; Biological ; Biology ; Computation ; Computational biology ; CpG Islands - genetics ; Genetic regulation ; Hidden Markov models ; Internet ; Markov Chains ; Mathematical analysis ; Mathematical models ; Numerical Analysis, Computer-Assisted ; Pattern Recognition, Automated - methods ; Physiological aspects ; Poisson Distribution ; Sequence Analysis, DNA - methods ; Transcription factors ; Variance</subject><ispartof>Journal of computational biology, 2010-04, Vol.17 (4), p.581-592</ispartof><rights>COPYRIGHT 2010 Mary Ann Liebert, Inc.</rights><rights>Copyright 2010, Mary Ann Liebert, Inc. 2010</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</citedby><cites>FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,315,781,785,886,3043,27929,27930</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/20426691$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhai, Zhiyuan</creatorcontrib><creatorcontrib>Ku, Shih-Yen</creatorcontrib><creatorcontrib>Luan, Yihui</creatorcontrib><creatorcontrib>Reinert, Gesine</creatorcontrib><creatorcontrib>Waterman, Michael S</creatorcontrib><creatorcontrib>Sun, Fengzhu</creatorcontrib><title>The power of detecting enriched patterns: an HMM approach</title><title>Journal of computational biology</title><addtitle>J Comput Biol</addtitle><description>The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.</description><subject>Approximation</subject><subject>Base Composition - genetics</subject><subject>Base Sequence</subject><subject>Binding sites</subject><subject>Biological</subject><subject>Biology</subject><subject>Computation</subject><subject>Computational biology</subject><subject>CpG Islands - genetics</subject><subject>Genetic regulation</subject><subject>Hidden Markov models</subject><subject>Internet</subject><subject>Markov Chains</subject><subject>Mathematical analysis</subject><subject>Mathematical models</subject><subject>Numerical Analysis, Computer-Assisted</subject><subject>Pattern Recognition, Automated - methods</subject><subject>Physiological aspects</subject><subject>Poisson Distribution</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Transcription factors</subject><subject>Variance</subject><issn>1066-5277</issn><issn>1557-8666</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqFkstrFTEUxoNU7EOXbmWgC93MNY_JqwuhFNsKLW7qOuQmZ-6NzCTTZK7if2-G2xYLomSRkPzOd04-PoTeErwiWOmPblyvKMZ6hSlRL9AR4Vy2SghxUM9YiJZTKQ_RcSnfMSZMYPkKHVLcUSE0OUL6bgvNlH5CblLfeJjBzSFuGog5uC34ZrLzDDmWs8bG5vr2trHTlJN129foZW-HAm8e9hP07fLz3cV1e_P16svF-U3rONFzC50DK6XHa84BvPay44QrRWXnVb-WlDumgare9wwTrzVXHhPSaacUEVSyE_Rprzvt1iN4B3HOdjBTDqPNv0yywTx_iWFrNumHYRSzOkIVeP8gkNP9DspsxlAcDIONkHbFSMY0rW1FJT_8kySSs2qvYuz_KKNMEUo7UtHTPbqxA5gQ-1THdAtuzikVyye1qtTqL1RdHsbgUoQ-1PtnBe2-wOVUSob-yRKCzRINU6NhlmiYJRqVf_enj0_0YxbYb6mPsUo</recordid><startdate>201004</startdate><enddate>201004</enddate><creator>Zhai, Zhiyuan</creator><creator>Ku, Shih-Yen</creator><creator>Luan, Yihui</creator><creator>Reinert, Gesine</creator><creator>Waterman, Michael S</creator><creator>Sun, Fengzhu</creator><general>Mary Ann Liebert, Inc</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QO</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>7SC</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>201004</creationdate><title>The power of detecting enriched patterns: an HMM approach</title><author>Zhai, Zhiyuan ; Ku, Shih-Yen ; Luan, Yihui ; Reinert, Gesine ; Waterman, Michael S ; Sun, Fengzhu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Approximation</topic><topic>Base Composition - genetics</topic><topic>Base Sequence</topic><topic>Binding sites</topic><topic>Biological</topic><topic>Biology</topic><topic>Computation</topic><topic>Computational biology</topic><topic>CpG Islands - genetics</topic><topic>Genetic regulation</topic><topic>Hidden Markov models</topic><topic>Internet</topic><topic>Markov Chains</topic><topic>Mathematical analysis</topic><topic>Mathematical models</topic><topic>Numerical Analysis, Computer-Assisted</topic><topic>Pattern Recognition, Automated - methods</topic><topic>Physiological aspects</topic><topic>Poisson Distribution</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Transcription factors</topic><topic>Variance</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhai, Zhiyuan</creatorcontrib><creatorcontrib>Ku, Shih-Yen</creatorcontrib><creatorcontrib>Luan, Yihui</creatorcontrib><creatorcontrib>Reinert, Gesine</creatorcontrib><creatorcontrib>Waterman, Michael S</creatorcontrib><creatorcontrib>Sun, Fengzhu</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Biotechnology Research Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhai, Zhiyuan</au><au>Ku, Shih-Yen</au><au>Luan, Yihui</au><au>Reinert, Gesine</au><au>Waterman, Michael S</au><au>Sun, Fengzhu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The power of detecting enriched patterns: an HMM approach</atitle><jtitle>Journal of computational biology</jtitle><addtitle>J Comput Biol</addtitle><date>2010-04</date><risdate>2010</risdate><volume>17</volume><issue>4</issue><spage>581</spage><epage>592</epage><pages>581-592</pages><issn>1066-5277</issn><eissn>1557-8666</eissn><abstract>The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.</abstract><cop>United States</cop><pub>Mary Ann Liebert, Inc</pub><pmid>20426691</pmid><doi>10.1089/cmb.2009.0218</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1066-5277
ispartof Journal of computational biology, 2010-04, Vol.17 (4), p.581-592
issn 1066-5277
1557-8666
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3203519
source Mary Ann Liebert Online Subscription; MEDLINE; Alma/SFX Local Collection
subjects Approximation
Base Composition - genetics
Base Sequence
Binding sites
Biological
Biology
Computation
Computational biology
CpG Islands - genetics
Genetic regulation
Hidden Markov models
Internet
Markov Chains
Mathematical analysis
Mathematical models
Numerical Analysis, Computer-Assisted
Pattern Recognition, Automated - methods
Physiological aspects
Poisson Distribution
Sequence Analysis, DNA - methods
Transcription factors
Variance
title The power of detecting enriched patterns: an HMM approach
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-12T05%3A40%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20power%20of%20detecting%20enriched%20patterns:%20an%20HMM%20approach&rft.jtitle=Journal%20of%20computational%20biology&rft.au=Zhai,%20Zhiyuan&rft.date=2010-04&rft.volume=17&rft.issue=4&rft.spage=581&rft.epage=592&rft.pages=581-592&rft.issn=1066-5277&rft.eissn=1557-8666&rft_id=info:doi/10.1089/cmb.2009.0218&rft_dat=%3Cgale_pubme%3EA226162798%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1323812241&rft_id=info:pmid/20426691&rft_galeid=A226162798&rfr_iscdi=true