Improved search heuristics find 20,000 new alignments between human and mouse genomes

Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many v...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Nucleic acids research 2014-04, Vol.42 (7), p.e59-e59
Hauptverfasser:	Frith, Martin C, Noé, Laurent
Format:	Artikel
Sprache:	eng
Schlagworte:	Animals Bioinformatics Computer Science Dogs Genome Genome, Human Genomics - methods Humans Life Sciences Methods Online Mice Quantitative Methods Sequence Alignment - methods Sequence Analysis, DNA - methods
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	e59
container_issue	7
container_start_page	e59
container_title	Nucleic acids research
container_volume	42
creator	Frith, Martin C Noé, Laurent
description	Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human-dog and melanogaster-pseudoobscura comparisons, but not for human-mouse, which suggests that we still miss many human-mouse alignments. Our optimized heuristics find ∼20,000 new human-mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.
doi_str_mv	10.1093/nar/gku104
format	Article
fullrecord	<record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3985675</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>1516722178</sourcerecordid><originalsourceid>FETCH-LOGICAL-c478t-a7fcd19e9d889f21aa24d23c709c12f57183c02787ba9557d70e26a6944932383</originalsourceid><addsrcrecordid>eNpdkUFv2zAMhYVhRZN1u-wHDDpuRd1SkmVJlwJF0S0FAvTSnAVFpmNvtpxJdor9-zlIG7Q9ESA_Pj7iEfKVwSUDI66Ci1ebPyOD_AOZM1HwLDcF_0jmIEBmU1vPyKeUfgOwnMn8lMx4nhuhhJqT1X23jf0OS5rQRV_TGsfYpKHxiVZNKCmHCwCgAZ-oa5tN6DAMia5xeEIMtB47F6ibuK4fE9INhr7D9JmcVK5N-OW5npHVz7vH20W2fPh1f3uzzHyu9JA5VfmSGTSl1qbizDmel1x4BcYzXknFtPDAlVZrZ6RUpQLkhSvM3j4XWpyR64Pudlx3WPrJW3St3camc_Gf7V1j305CU9tNv7PCaFkoOQn8OAjU79YWN0u77wEYqTmoHZvY78_HYv93xDTYrkke29YFnJ63TLJCcc7U3tf5AfWxTyliddRmYPeZ2Skze8hsgr-9fuKIvoQk_gP88JJG</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>1516722178</pqid></control><display><type>article</type><title>Improved search heuristics find 20,000 new alignments between human and mouse genomes</title><source>Oxford Journals Open Access Collection</source><source>MEDLINE</source><source>DOAJ Directory of Open Access Journals</source><source>PubMed Central</source><source>Free Full-Text Journals in Chemistry</source><creator>Frith, Martin C ; Noé, Laurent</creator><creatorcontrib>Frith, Martin C ; Noé, Laurent</creatorcontrib><description>Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human-dog and melanogaster-pseudoobscura comparisons, but not for human-mouse, which suggests that we still miss many human-mouse alignments. Our optimized heuristics find ∼20,000 new human-mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.</description><identifier>ISSN: 0305-1048</identifier><identifier>EISSN: 1362-4962</identifier><identifier>DOI: 10.1093/nar/gku104</identifier><identifier>PMID: 24493737</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Animals ; Bioinformatics ; Computer Science ; Dogs ; Genome ; Genome, Human ; Genomics - methods ; Humans ; Life Sciences ; Methods Online ; Mice ; Quantitative Methods ; Sequence Alignment - methods ; Sequence Analysis, DNA - methods</subject><ispartof>Nucleic acids research, 2014-04, Vol.42 (7), p.e59-e59</ispartof><rights>Distributed under a Creative Commons Attribution 4.0 International License</rights><rights>The Author(s) 2014. Published by Oxford University Press. 2014</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c478t-a7fcd19e9d889f21aa24d23c709c12f57183c02787ba9557d70e26a6944932383</citedby><cites>FETCH-LOGICAL-c478t-a7fcd19e9d889f21aa24d23c709c12f57183c02787ba9557d70e26a6944932383</cites><orcidid>0000-0002-1170-8376</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3985675/pdf/$$EPDF$$P50$$Gpubmedcentral$$Hfree_for_read</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC3985675/$$EHTML$$P50$$Gpubmedcentral$$Hfree_for_read</linktohtml><link.rule.ids>230,314,723,776,780,860,881,27901,27902,53766,53768</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/24493737$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink><backlink>$$Uhttps://inria.hal.science/hal-00958207$$DView record in HAL$$Hfree_for_read</backlink></links><search><creatorcontrib>Frith, Martin C</creatorcontrib><creatorcontrib>Noé, Laurent</creatorcontrib><title>Improved search heuristics find 20,000 new alignments between human and mouse genomes</title><title>Nucleic acids research</title><addtitle>Nucleic Acids Res</addtitle><description>Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human-dog and melanogaster-pseudoobscura comparisons, but not for human-mouse, which suggests that we still miss many human-mouse alignments. Our optimized heuristics find ∼20,000 new human-mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.</description><subject>Animals</subject><subject>Bioinformatics</subject><subject>Computer Science</subject><subject>Dogs</subject><subject>Genome</subject><subject>Genome, Human</subject><subject>Genomics - methods</subject><subject>Humans</subject><subject>Life Sciences</subject><subject>Methods Online</subject><subject>Mice</subject><subject>Quantitative Methods</subject><subject>Sequence Alignment - methods</subject><subject>Sequence Analysis, DNA - methods</subject><issn>0305-1048</issn><issn>1362-4962</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2014</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNpdkUFv2zAMhYVhRZN1u-wHDDpuRd1SkmVJlwJF0S0FAvTSnAVFpmNvtpxJdor9-zlIG7Q9ESA_Pj7iEfKVwSUDI66Ci1ebPyOD_AOZM1HwLDcF_0jmIEBmU1vPyKeUfgOwnMn8lMx4nhuhhJqT1X23jf0OS5rQRV_TGsfYpKHxiVZNKCmHCwCgAZ-oa5tN6DAMia5xeEIMtB47F6ibuK4fE9INhr7D9JmcVK5N-OW5npHVz7vH20W2fPh1f3uzzHyu9JA5VfmSGTSl1qbizDmel1x4BcYzXknFtPDAlVZrZ6RUpQLkhSvM3j4XWpyR64Pudlx3WPrJW3St3camc_Gf7V1j305CU9tNv7PCaFkoOQn8OAjU79YWN0u77wEYqTmoHZvY78_HYv93xDTYrkke29YFnJ63TLJCcc7U3tf5AfWxTyliddRmYPeZ2Skze8hsgr-9fuKIvoQk_gP88JJG</recordid><startdate>20140401</startdate><enddate>20140401</enddate><creator>Frith, Martin C</creator><creator>Noé, Laurent</creator><general>Oxford University Press</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>1XC</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-1170-8376</orcidid></search><sort><creationdate>20140401</creationdate><title>Improved search heuristics find 20,000 new alignments between human and mouse genomes</title><author>Frith, Martin C ; Noé, Laurent</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c478t-a7fcd19e9d889f21aa24d23c709c12f57183c02787ba9557d70e26a6944932383</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2014</creationdate><topic>Animals</topic><topic>Bioinformatics</topic><topic>Computer Science</topic><topic>Dogs</topic><topic>Genome</topic><topic>Genome, Human</topic><topic>Genomics - methods</topic><topic>Humans</topic><topic>Life Sciences</topic><topic>Methods Online</topic><topic>Mice</topic><topic>Quantitative Methods</topic><topic>Sequence Alignment - methods</topic><topic>Sequence Analysis, DNA - methods</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Frith, Martin C</creatorcontrib><creatorcontrib>Noé, Laurent</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>Hyper Article en Ligne (HAL)</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Nucleic acids research</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Frith, Martin C</au><au>Noé, Laurent</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Improved search heuristics find 20,000 new alignments between human and mouse genomes</atitle><jtitle>Nucleic acids research</jtitle><addtitle>Nucleic Acids Res</addtitle><date>2014-04-01</date><risdate>2014</risdate><volume>42</volume><issue>7</issue><spage>e59</spage><epage>e59</epage><pages>e59-e59</pages><issn>0305-1048</issn><eissn>1362-4962</eissn><abstract>Sequence similarity search is a fundamental way of analyzing nucleotide sequences. Despite decades of research, this is not a solved problem because there exist many similarities that are not found by current methods. Search methods are typically based on a seed-and-extend approach, which has many variants (e.g. spaced seeds, transition seeds), and it remains unclear how to optimize this approach. This study designs and tests seeding methods for inter-mammal and inter-insect genome comparison. By considering substitution patterns of real genomes, we design sets of multiple complementary transition seeds, which have better performance (sensitivity per run time) than previous seeding strategies. Often the best seed patterns have more transition positions than those used previously. We also point out that recent computer memory sizes (e.g. 60 GB) make it feasible to use multiple (e.g. eight) seeds for whole mammal genomes. Interestingly, the most sensitive settings achieve diminishing returns for human-dog and melanogaster-pseudoobscura comparisons, but not for human-mouse, which suggests that we still miss many human-mouse alignments. Our optimized heuristics find ∼20,000 new human-mouse alignments that are missing from the standard UCSC alignments. We tabulate seed patterns and parameters that work well so they can be used in future research.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>24493737</pmid><doi>10.1093/nar/gku104</doi><orcidid>https://orcid.org/0000-0002-1170-8376</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 0305-1048
ispartof	Nucleic acids research, 2014-04, Vol.42 (7), p.e59-e59
issn	0305-1048 1362-4962
language	eng
recordid	cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_3985675
source	Oxford Journals Open Access Collection; MEDLINE; DOAJ Directory of Open Access Journals; PubMed Central; Free Full-Text Journals in Chemistry
subjects	Animals Bioinformatics Computer Science Dogs Genome Genome, Human Genomics - methods Humans Life Sciences Methods Online Mice Quantitative Methods Sequence Alignment - methods Sequence Analysis, DNA - methods
title	Improved search heuristics find 20,000 new alignments between human and mouse genomes
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T21%3A27%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Improved%20search%20heuristics%20find%2020,000%20new%20alignments%20between%20human%20and%20mouse%20genomes&rft.jtitle=Nucleic%20acids%20research&rft.au=Frith,%20Martin%20C&rft.date=2014-04-01&rft.volume=42&rft.issue=7&rft.spage=e59&rft.epage=e59&rft.pages=e59-e59&rft.issn=0305-1048&rft.eissn=1362-4962&rft_id=info:doi/10.1093/nar/gku104&rft_dat=%3Cproquest_pubme%3E1516722178%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=1516722178&rft_id=info:pmid/24493737&rfr_iscdi=true