The power of detecting enriched patterns: an HMM approach
The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The se...
Gespeichert in:
Veröffentlicht in: | Journal of computational biology 2010-04, Vol.17 (4), p.581-592 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 592 |
---|---|
container_issue | 4 |
container_start_page | 581 |
container_title | Journal of computational biology |
container_volume | 17 |
creator | Zhai, Zhiyuan Ku, Shih-Yen Luan, Yihui Reinert, Gesine Waterman, Michael S Sun, Fengzhu |
description | The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples. |
doi_str_mv | 10.1089/cmb.2009.0218 |
format | Article |
fullrecord | <record><control><sourceid>gale_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3203519</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><galeid>A226162798</galeid><sourcerecordid>A226162798</sourcerecordid><originalsourceid>FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</originalsourceid><addsrcrecordid>eNqFkstrFTEUxoNU7EOXbmWgC93MNY_JqwuhFNsKLW7qOuQmZ-6NzCTTZK7if2-G2xYLomSRkPzOd04-PoTeErwiWOmPblyvKMZ6hSlRL9AR4Vy2SghxUM9YiJZTKQ_RcSnfMSZMYPkKHVLcUSE0OUL6bgvNlH5CblLfeJjBzSFuGog5uC34ZrLzDDmWs8bG5vr2trHTlJN129foZW-HAm8e9hP07fLz3cV1e_P16svF-U3rONFzC50DK6XHa84BvPay44QrRWXnVb-WlDumgare9wwTrzVXHhPSaacUEVSyE_Rprzvt1iN4B3HOdjBTDqPNv0yywTx_iWFrNumHYRSzOkIVeP8gkNP9DspsxlAcDIONkHbFSMY0rW1FJT_8kySSs2qvYuz_KKNMEUo7UtHTPbqxA5gQ-1THdAtuzikVyye1qtTqL1RdHsbgUoQ-1PtnBe2-wOVUSob-yRKCzRINU6NhlmiYJRqVf_enj0_0YxbYb6mPsUo</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1323812241</pqid></control><display><type>article</type><title>The power of detecting enriched patterns: an HMM approach</title><source>Mary Ann Liebert Online Subscription</source><source>MEDLINE</source><source>Alma/SFX Local Collection</source><creator>Zhai, Zhiyuan ; Ku, Shih-Yen ; Luan, Yihui ; Reinert, Gesine ; Waterman, Michael S ; Sun, Fengzhu</creator><creatorcontrib>Zhai, Zhiyuan ; Ku, Shih-Yen ; Luan, Yihui ; Reinert, Gesine ; Waterman, Michael S ; Sun, Fengzhu</creatorcontrib><description>The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.</description><identifier>ISSN: 1066-5277</identifier><identifier>EISSN: 1557-8666</identifier><identifier>DOI: 10.1089/cmb.2009.0218</identifier><identifier>PMID: 20426691</identifier><language>eng</language><publisher>United States: Mary Ann Liebert, Inc</publisher><subject>Approximation ; Base Composition - genetics ; Base Sequence ; Binding sites ; Biological ; Biology ; Computation ; Computational biology ; CpG Islands - genetics ; Genetic regulation ; Hidden Markov models ; Internet ; Markov Chains ; Mathematical analysis ; Mathematical models ; Numerical Analysis, Computer-Assisted ; Pattern Recognition, Automated - methods ; Physiological aspects ; Poisson Distribution ; Sequence Analysis, DNA - methods ; Transcription factors ; Variance</subject><ispartof>Journal of computational biology, 2010-04, Vol.17 (4), p.581-592</ispartof><rights>COPYRIGHT 2010 Mary Ann Liebert, Inc.</rights><rights>Copyright 2010, Mary Ann Liebert, Inc. 2010</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</citedby><cites>FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>230,315,781,785,886,3043,27929,27930</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/20426691$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhai, Zhiyuan</creatorcontrib><creatorcontrib>Ku, Shih-Yen</creatorcontrib><creatorcontrib>Luan, Yihui</creatorcontrib><creatorcontrib>Reinert, Gesine</creatorcontrib><creatorcontrib>Waterman, Michael S</creatorcontrib><creatorcontrib>Sun, Fengzhu</creatorcontrib><title>The power of detecting enriched patterns: an HMM approach</title><title>Journal of computational biology</title><addtitle>J Comput Biol</addtitle><description>The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.</description><subject>Approximation</subject><subject>Base Composition - genetics</subject><subject>Base Sequence</subject><subject>Binding sites</subject><subject>Biological</subject><subject>Biology</subject><subject>Computation</subject><subject>Computational biology</subject><subject>CpG Islands - genetics</subject><subject>Genetic regulation</subject><subject>Hidden Markov models</subject><subject>Internet</subject><subject>Markov Chains</subject><subject>Mathematical analysis</subject><subject>Mathematical models</subject><subject>Numerical Analysis, Computer-Assisted</subject><subject>Pattern Recognition, Automated - methods</subject><subject>Physiological aspects</subject><subject>Poisson Distribution</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Transcription factors</subject><subject>Variance</subject><issn>1066-5277</issn><issn>1557-8666</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2010</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqFkstrFTEUxoNU7EOXbmWgC93MNY_JqwuhFNsKLW7qOuQmZ-6NzCTTZK7if2-G2xYLomSRkPzOd04-PoTeErwiWOmPblyvKMZ6hSlRL9AR4Vy2SghxUM9YiJZTKQ_RcSnfMSZMYPkKHVLcUSE0OUL6bgvNlH5CblLfeJjBzSFuGog5uC34ZrLzDDmWs8bG5vr2trHTlJN129foZW-HAm8e9hP07fLz3cV1e_P16svF-U3rONFzC50DK6XHa84BvPay44QrRWXnVb-WlDumgare9wwTrzVXHhPSaacUEVSyE_Rprzvt1iN4B3HOdjBTDqPNv0yywTx_iWFrNumHYRSzOkIVeP8gkNP9DspsxlAcDIONkHbFSMY0rW1FJT_8kySSs2qvYuz_KKNMEUo7UtHTPbqxA5gQ-1THdAtuzikVyye1qtTqL1RdHsbgUoQ-1PtnBe2-wOVUSob-yRKCzRINU6NhlmiYJRqVf_enj0_0YxbYb6mPsUo</recordid><startdate>201004</startdate><enddate>201004</enddate><creator>Zhai, Zhiyuan</creator><creator>Ku, Shih-Yen</creator><creator>Luan, Yihui</creator><creator>Reinert, Gesine</creator><creator>Waterman, Michael S</creator><creator>Sun, Fengzhu</creator><general>Mary Ann Liebert, Inc</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QO</scope><scope>8FD</scope><scope>FR3</scope><scope>P64</scope><scope>7SC</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><scope>5PM</scope></search><sort><creationdate>201004</creationdate><title>The power of detecting enriched patterns: an HMM approach</title><author>Zhai, Zhiyuan ; Ku, Shih-Yen ; Luan, Yihui ; Reinert, Gesine ; Waterman, Michael S ; Sun, Fengzhu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c519t-e4cea77d0b55eed9d7451588274d8fb725c39e28fdf301d9958d01149c8816273</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2010</creationdate><topic>Approximation</topic><topic>Base Composition - genetics</topic><topic>Base Sequence</topic><topic>Binding sites</topic><topic>Biological</topic><topic>Biology</topic><topic>Computation</topic><topic>Computational biology</topic><topic>CpG Islands - genetics</topic><topic>Genetic regulation</topic><topic>Hidden Markov models</topic><topic>Internet</topic><topic>Markov Chains</topic><topic>Mathematical analysis</topic><topic>Mathematical models</topic><topic>Numerical Analysis, Computer-Assisted</topic><topic>Pattern Recognition, Automated - methods</topic><topic>Physiological aspects</topic><topic>Poisson Distribution</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Transcription factors</topic><topic>Variance</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhai, Zhiyuan</creatorcontrib><creatorcontrib>Ku, Shih-Yen</creatorcontrib><creatorcontrib>Luan, Yihui</creatorcontrib><creatorcontrib>Reinert, Gesine</creatorcontrib><creatorcontrib>Waterman, Michael S</creatorcontrib><creatorcontrib>Sun, Fengzhu</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Biotechnology Research Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of computational biology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhai, Zhiyuan</au><au>Ku, Shih-Yen</au><au>Luan, Yihui</au><au>Reinert, Gesine</au><au>Waterman, Michael S</au><au>Sun, Fengzhu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>The power of detecting enriched patterns: an HMM approach</atitle><jtitle>Journal of computational biology</jtitle><addtitle>J Comput Biol</addtitle><date>2010-04</date><risdate>2010</risdate><volume>17</volume><issue>4</issue><spage>581</spage><epage>592</epage><pages>581-592</pages><issn>1066-5277</issn><eissn>1557-8666</eissn><abstract>The identification of binding sites of transcription factors (TF) and other regulatory regions, referred to as motifs, located in a set of molecular sequences is of fundamental importance in genomic research. Many computational and experimental approaches have been developed to locate motifs. The set of sequences of interest can be concatenated to form a long sequence of length n. One of the successful approaches for motif discovery is to identify statistically over- or under-represented patterns in this long sequence. A pattern refers to a fixed word W over the alphabet. In the example of interest, W is a word in the set of patterns of the motif. Despite extensive studies on motif discovery, no studies have been carried out on the power of detecting statistically over- or under-represented patterns Here we address the issue of how the known presence of random instances of a known motif affects the power of detecting patterns, such as patterns within the motif. Let N(W)(n) be the number of possibly overlapping occurrences of a pattern W in the sequence that contains instances of a known motif; such a sequence is modeled here by a Hidden Markov Model (HMM). First, efficient computational methods for calculating the mean and variance of N(W)(n) are developed. Second, efficient computational methods for calculating parameters involved in the normal approximation of N(W)(n) for frequent patterns and compound Poisson approximation of N(W)(n) for rare patterns are developed. Third, an easy to use web program is developed to calculate the power of detecting patterns and the program is used to study the power of detection in several interesting biological examples.</abstract><cop>United States</cop><pub>Mary Ann Liebert, Inc</pub><pmid>20426691</pmid><doi>10.1089/cmb.2009.0218</doi><tpages>12</tpages><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1066-5277 |
ispartof | Journal of computational biology, 2010-04, Vol.17 (4), p.581-592 |
issn | 1066-5277 1557-8666 |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3203519 |
source | Mary Ann Liebert Online Subscription; MEDLINE; Alma/SFX Local Collection |
subjects | Approximation Base Composition - genetics Base Sequence Binding sites Biological Biology Computation Computational biology CpG Islands - genetics Genetic regulation Hidden Markov models Internet Markov Chains Mathematical analysis Mathematical models Numerical Analysis, Computer-Assisted Pattern Recognition, Automated - methods Physiological aspects Poisson Distribution Sequence Analysis, DNA - methods Transcription factors Variance |
title | The power of detecting enriched patterns: an HMM approach |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-12T05%3A40%3A15IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-gale_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=The%20power%20of%20detecting%20enriched%20patterns:%20an%20HMM%20approach&rft.jtitle=Journal%20of%20computational%20biology&rft.au=Zhai,%20Zhiyuan&rft.date=2010-04&rft.volume=17&rft.issue=4&rft.spage=581&rft.epage=592&rft.pages=581-592&rft.issn=1066-5277&rft.eissn=1557-8666&rft_id=info:doi/10.1089/cmb.2009.0218&rft_dat=%3Cgale_pubme%3EA226162798%3C/gale_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1323812241&rft_id=info:pmid/20426691&rft_galeid=A226162798&rfr_iscdi=true |