The power of detecting enriched patterns: an HMM approach

The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The se...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of computational biology 2010-04, Vol.17 (4), p.581-592
Hauptverfasser:	Zhai, Zhiyuan, Ku, Shih-Yen, Luan, Yihui, Reinert, Gesine, Waterman, Michael S, Sun, Fengzhu
Format:	Artikel
Sprache:	eng
Schlagworte:	Approximation Base Composition - genetics Base Sequence Binding sites Biological Biology Computation Computational biology CpG Islands - genetics Genetic regulation Hidden Markov models Internet Markov Chains Mathematical analysis Mathematical models Numerical Analysis, Computer-Assisted Pattern Recognition, Automated - methods Physiological aspects Poisson Distribution Sequence Analysis, DNA - methods Transcription factors Variance
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	592
container_issue	4
container_start_page	581
container_title	Journal of computational biology
container_volume	17
creator	Zhai, Zhiyuan Ku, Shih-Yen Luan, Yihui Reinert, Gesine Waterman, Michael S Sun, Fengzhu
description	The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.
doi_str_mv	10.1089/cmb.2009.0218
format	Article
fullrecord	<record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3203519</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A226162798</galeid><sourcerecordid>A226162798</sourcerecordid><originalsourceid>FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</originalsourceid><addsrcrecordid>eNqFkstrFTEUxoNU7EOXbmWgC93MNY_JqwuhFNsKLW7qOuQmZ-6NzCTTZK7if2-G2xYLomSRkPzOd04-PoTeErwiWOmPblyvKMZ6hSlRL9AR4Vy2SghxUM9YiJZTKQ_RcSnfMSZMYPkKHVLcUSE0OUL6bgvNlH5CblLfeJjBzSFuGog5uC34ZrLzDDmWs8bG5vr2trHTlJN129foZW-HAm8e9hP07fLz3cV1e_P16svF-U3rONFzC50DK6XHa84BvPay44QrRWXnVb-WlDumgare9wwTrzVXHhPSaacUEVSyE_Rprzvt1iN4B3HOdjBTDqPNv0yywTx_iWFrNumHYRSzOkIVeP8gkNP9DspsxlAcDIONkHbFSMY0rW1FJT_8kySSs2qvYuz_KKNMEUo7UtHTPbqxA5gQ-1THdAtuzikVyye1qtTqL1RdHsbgUoQ-1PtnBe2-wOVUSob-yRKCzRINU6NhlmiYJRqVf_enj0_0YxbYb6mPsUo</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1323812241</pqid></control><display><type>article</type><title>The power of detecting enriched patterns: an HMM approach</title><source>Mary Ann Liebert Online Subscription</source><source>MEDLINE</source><source>Alma/SFX Local Collection</source><creator>Zhai, Zhiyuan ; Ku, Shih-Yen ; Luan, Yihui ; Reinert, Gesine ; Waterman, Michael S ; Sun, Fengzhu</creator><creatorcontrib>Zhai, Zhiyuan ; Ku, Shih-Yen ; Luan, Yihui ; Reinert, Gesine ; Waterman, Michael S ; Sun, Fengzhu</creatorcontrib><description>The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.</description><identifier>ISSN: 1066-5277</identifier><identifier>EISSN: 1557-8666</identifier><identifier>DOI: 10.1089/cmb.2009.0218</identifier><identifier>PMID: 20426691</identifier><language>eng</language><publisher>United States: Mary Ann Liebert, Inc</publisher><subject>Approximation ; Base Composition - genetics ; Base Sequence ; Binding sites ; Biological ; Biology ; Computation ; Computational biology ; CpG Islands - genetics ; Genetic regulation ; Hidden Markov models ; Internet ; Markov Chains ; Mathematical analysis ; Mathematical models ; Numerical Analysis, Computer-Assisted ; Pattern Recognition, Automated - methods ; Physiological aspects ; Poisson Distribution ; Sequence Analysis, DNA - methods ; Transcription factors ; Variance</subject><ispartof>Journal of computational biology, 2010-04, Vol.17 (4), p.581-592</ispartof><rights>COPYRIGHT 2010 Mary Ann Liebert, Inc.</rights><rights>Copyright 2010, Mary Ann Liebert, Inc. 2010</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</citedby><cites>FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,315,781,785,886,3043,27929,27930</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/20426691$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhai, Zhiyuan</creatorcontrib><creatorcontrib>Ku, Shih-Yen</creatorcontrib><creatorcontrib>Luan, Yihui</creatorcontrib><creatorcontrib>Reinert, Gesine</creatorcontrib><creatorcontrib>Waterman, Michael S</creatorcontrib><creatorcontrib>Sun, Fengzhu</creatorcontrib><title>The power of detecting enriched patterns: an HMM approach</title><title>Journal of computational biology</title><addtitle>J Comput Biol</addtitle><description>The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.</description><subject>Approximation</subject><subject>Base Composition - genetics</subject><subject>Base Sequence</subject><subject>Binding sites</subject><subject>Biological</subject><subject>Biology</subject><subject>Computation</subject><subject>Computational biology</subject><subject>CpG Islands - genetics</subject><subject>Genetic regulation</subject><subject>Hidden Markov models</subject><subject>Internet</subject><subject>Markov Chains</subject><subject>Mathematical analysis</subject><subject>Mathematical models</subject><subject>Numerical Analysis, Computer-Assisted</subject><subject>Pattern Recognition, Automated - methods</subject><subject>Physiological aspects</subject><subject>Poisson Distribution</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Transcription factors</subject><subject>Variance</subject><issn>1066-5277</issn><issn>1557-8666</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqFkstrFTEUxoNU7EOXbmWgC93MNY_JqwuhFNsKLW7qOuQmZ-6NzCTTZK7if2-G2xYLomSRkPzOd04-PoTeErwiWOmPblyvKMZ6hSlRL9AR4Vy2SghxUM9YiJZTKQ_RcSnfMSZMYPkKHVLcUSE0OUL6bgvNlH5CblLfeJjBzSFuGog5uC34ZrLzDDmWs8bG5vr2trHTlJN129foZW-HAm8e9hP07fLz3cV1e_P16svF-U3rONFzC50DK6XHa84BvPay44QrRWXnVb-WlDumgare9wwTrzVXHhPSaacUEVSyE_Rprzvt1iN4B3HOdjBTDqPNv0yywTx_iWFrNumHYRSzOkIVeP8gkNP9DspsxlAcDIONkHbFSMY0rW1FJT_8kySSs2qvYuz_KKNMEUo7UtHTPbqxA5gQ-1THdAtuzikVyye1qtTqL1RdHsbgUoQ-1PtnBe2-wOVUSob-yRKCzRINU6NhlmiYJRqVf_enj0_0YxbYb6mPsUo</recordid><startdate>201004</startdate><enddate>201004</enddate><creator>Zhai, Zhiyuan</creator><creator>Ku, Shih-Yen</creator><creator>Luan, Yihui</creator><creator>Reinert, Gesine</creator><creator>Waterman, Michael S</creator><creator>Sun, Fengzhu</creator><general>Mary Ann Liebert, Inc</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QO</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>7SC</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>201004</creationdate><title>The power of detecting enriched patterns: an HMM approach</title><author>Zhai, Zhiyuan ; Ku, Shih-Yen ; Luan, Yihui ; Reinert, Gesine ; Waterman, Michael S ; Sun, Fengzhu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Approximation</topic><topic>Base Composition - genetics</topic><topic>Base Sequence</topic><topic>Binding sites</topic><topic>Biological</topic><topic>Biology</topic><topic>Computation</topic><topic>Computational biology</topic><topic>CpG Islands - genetics</topic><topic>Genetic regulation</topic><topic>Hidden Markov models</topic><topic>Internet</topic><topic>Markov Chains</topic><topic>Mathematical analysis</topic><topic>Mathematical models</topic><topic>Numerical Analysis, Computer-Assisted</topic><topic>Pattern Recognition, Automated - methods</topic><topic>Physiological aspects</topic><topic>Poisson Distribution</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Transcription factors</topic><topic>Variance</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhai, Zhiyuan</creatorcontrib><creatorcontrib>Ku, Shih-Yen</creatorcontrib><creatorcontrib>Luan, Yihui</creatorcontrib><creatorcontrib>Reinert, Gesine</creatorcontrib><creatorcontrib>Waterman, Michael S</creatorcontrib><creatorcontrib>Sun, Fengzhu</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Biotechnology Research Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhai, Zhiyuan</au><au>Ku, Shih-Yen</au><au>Luan, Yihui</au><au>Reinert, Gesine</au><au>Waterman, Michael S</au><au>Sun, Fengzhu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The power of detecting enriched patterns: an HMM approach</atitle><jtitle>Journal of computational biology</jtitle><addtitle>J Comput Biol</addtitle><date>2010-04</date><risdate>2010</risdate><volume>17</volume><issue>4</issue><spage>581</spage><epage>592</epage><pages>581-592</pages><issn>1066-5277</issn><eissn>1557-8666</eissn><abstract>The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.</abstract><cop>United States</cop><pub>Mary Ann Liebert, Inc</pub><pmid>20426691</pmid><doi>10.1089/cmb.2009.0218</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1066-5277
ispartof	Journal of computational biology, 2010-04, Vol.17 (4), p.581-592
issn	1066-5277 1557-8666
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3203519
source	Mary Ann Liebert Online Subscription; MEDLINE; Alma/SFX Local Collection
subjects	Approximation Base Composition - genetics Base Sequence Binding sites Biological Biology Computation Computational biology CpG Islands - genetics Genetic regulation Hidden Markov models Internet Markov Chains Mathematical analysis Mathematical models Numerical Analysis, Computer-Assisted Pattern Recognition, Automated - methods Physiological aspects Poisson Distribution Sequence Analysis, DNA - methods Transcription factors Variance
title	The power of detecting enriched patterns: an HMM approach
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-12T05%3A40%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20power%20of%20detecting%20enriched%20patterns:%20an%20HMM%20approach&rft.jtitle=Journal%20of%20computational%20biology&rft.au=Zhai,%20Zhiyuan&rft.date=2010-04&rft.volume=17&rft.issue=4&rft.spage=581&rft.epage=592&rft.pages=581-592&rft.issn=1066-5277&rft.eissn=1557-8666&rft_id=info:doi/10.1089/cmb.2009.0218&rft_dat=%3Cgale_pubme%3EA226162798%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1323812241&rft_id=info:pmid/20426691&rft_galeid=A226162798&rfr_iscdi=true