Benchmarking the empirical accuracy of short-read sequencing across the M. tuberculosis genome

Abstract Motivation Short-read whole-genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences and sequencing bias reduces the performance of variant calling using short-read alignment, but the loss in rec...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Bioinformatics 2022-03, Vol.38 (7), p.1781-1787
Hauptverfasser: Marin, Maximillian, Vargas, Roger, Harris, Michael, Jeffrey, Brendan, Epperson, L Elaine, Durbin, David, Strong, Michael, Salfinger, Max, Iqbal, Zamin, Akhundova, Irada, Vashakidze, Sergo, Crudu, Valeriu, Rosenthal, Alex, Farhat, Maha Reda
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1787
container_issue 7
container_start_page 1781
container_title Bioinformatics
container_volume 38
creator Marin, Maximillian
Vargas, Roger
Harris, Michael
Jeffrey, Brendan
Epperson, L Elaine
Durbin, David
Strong, Michael
Salfinger, Max
Iqbal, Zamin
Akhundova, Irada
Vashakidze, Sergo
Crudu, Valeriu
Rosenthal, Alex
Farhat, Maha Reda
description Abstract Motivation Short-read whole-genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences and sequencing bias reduces the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. To benchmark short-read variant calling, we used 36 diverse clinical Mycobacterium tuberculosis (Mtb) isolates dually sequenced with Illumina short-reads and PacBio long-reads. We systematically studied the short-read variant calling accuracy and the influence of sequence uniqueness, reference bias and GC content. Results Reference-based Illumina variant calling demonstrated a maximum recall of 89.0% and minimum precision of 98.5% across parameters evaluated. The approach that maximized variant recall while still maintaining high precision (
doi_str_mv 10.1093/bioinformatics/btac023
format Article
fullrecord <record><control><sourceid>proquest_TOX</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_8963317</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><oup_id>10.1093/bioinformatics/btac023</oup_id><sourcerecordid>2619545860</sourcerecordid><originalsourceid>FETCH-LOGICAL-c461t-df38a4a24b397a22a7acce0887861673a6701cc50fa46178f5eff981703fe0173</originalsourceid><addsrcrecordid>eNqNkU9v1DAQxS0Eon_gK1Q5ckk7jh3buSBBVShSEZf2ijXxjncNSbzYSaV--7q7S0VvnMbS_N4bPz3Gzjicc-jERR9imHxMI87B5Yt-RgeNeMWOuVRQN9B2r8tbKF1LA-KIneT8C6DlUsq37Ei00IDuxDH7-Zkmtxkx_Q7Tupo3VNG4DSk4HCp0bknoHqroq7yJaa4T4arK9GcpoiceXYo572Tfz6t56Sm5ZYg55GpNUxzpHXvjccj0_jBP2d2Xq9vL6_rmx9dvl59uaicVn-uVFwYlNrIXncamQV1uExijjeJKC1QauHMteCy8Nr4l7zvDNQhPwLU4ZR_3vtulH2nlaJoTDnabQon2YCMG-3IzhY1dx3trOiXEzuDDwSDFEi_PdgzZ0TDgRHHJtlG8a2VrFBRU7dFd-ET--QwH-1SOfVmOPZRThGf_fvJZ9reNAvA9EJft_5o-AuO3pOY</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2619545860</pqid></control><display><type>article</type><title>Benchmarking the empirical accuracy of short-read sequencing across the M. tuberculosis genome</title><source>Oxford Journals Open Access Collection</source><creator>Marin, Maximillian ; Vargas, Roger ; Harris, Michael ; Jeffrey, Brendan ; Epperson, L Elaine ; Durbin, David ; Strong, Michael ; Salfinger, Max ; Iqbal, Zamin ; Akhundova, Irada ; Vashakidze, Sergo ; Crudu, Valeriu ; Rosenthal, Alex ; Farhat, Maha Reda</creator><creatorcontrib>Marin, Maximillian ; Vargas, Roger ; Harris, Michael ; Jeffrey, Brendan ; Epperson, L Elaine ; Durbin, David ; Strong, Michael ; Salfinger, Max ; Iqbal, Zamin ; Akhundova, Irada ; Vashakidze, Sergo ; Crudu, Valeriu ; Rosenthal, Alex ; Farhat, Maha Reda</creatorcontrib><description>Abstract Motivation Short-read whole-genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences and sequencing bias reduces the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. To benchmark short-read variant calling, we used 36 diverse clinical Mycobacterium tuberculosis (Mtb) isolates dually sequenced with Illumina short-reads and PacBio long-reads. We systematically studied the short-read variant calling accuracy and the influence of sequence uniqueness, reference bias and GC content. Results Reference-based Illumina variant calling demonstrated a maximum recall of 89.0% and minimum precision of 98.5% across parameters evaluated. The approach that maximized variant recall while still maintaining high precision (&lt;99%) was tuning the mapping quality filtering threshold, i.e. confidence of the read mapping (recall = 85.8%, precision = 99.1%, MQ ≥ 40). Additional masking of repetitive sequence content is an alternative conservative approach to variant calling that increases precision at cost to recall (recall = 70.2%, precision = 99.6%, MQ ≥ 40). Of the genomic positions typically excluded for Mtb, 68% are accurately called using Illumina WGS including 52/168 PE/PPE genes (34.5%). From these results, we present a refined list of low confidence regions across the Mtb genome, which we found to frequently overlap with regions with structural variation, low sequence uniqueness and low sequencing coverage. Our benchmarking results have broad implications for the use of WGS in the study of Mtb biology, inference of transmission in public health surveillance systems and more generally for WGS applications in other organisms. Availability and implementation All relevant code is available at https://github.com/farhat-lab/mtb-illumina-wgs-evaluation. Supplementary information Supplementary data are available at Bioinformatics online.</description><identifier>ISSN: 1367-4803</identifier><identifier>EISSN: 1460-2059</identifier><identifier>EISSN: 1367-4811</identifier><identifier>DOI: 10.1093/bioinformatics/btac023</identifier><identifier>PMID: 35020793</identifier><language>eng</language><publisher>England: Oxford University Press</publisher><subject>Benchmarking ; High-Throughput Nucleotide Sequencing - methods ; Humans ; Mycobacterium tuberculosis - genetics ; Original Papers ; Sequence Analysis, DNA - methods ; Software ; Tuberculosis</subject><ispartof>Bioinformatics, 2022-03, Vol.38 (7), p.1781-1787</ispartof><rights>The Author(s) 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com 2022</rights><rights>The Author(s) 2022. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c461t-df38a4a24b397a22a7acce0887861673a6701cc50fa46178f5eff981703fe0173</citedby><cites>FETCH-LOGICAL-c461t-df38a4a24b397a22a7acce0887861673a6701cc50fa46178f5eff981703fe0173</cites><orcidid>0000-0001-5059-8002 ; 0000-0002-9108-3328</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8963317/pdf/$$EPDF$$P50$$Gpubmedcentral$$H</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC8963317/$$EHTML$$P50$$Gpubmedcentral$$H</linktohtml><link.rule.ids>230,314,723,776,780,881,1598,27901,27902,53766,53768</link.rule.ids><linktorsrc>$$Uhttps://dx.doi.org/10.1093/bioinformatics/btac023$$EView_record_in_Oxford_University_Press$$FView_record_in_$$GOxford_University_Press</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/35020793$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Marin, Maximillian</creatorcontrib><creatorcontrib>Vargas, Roger</creatorcontrib><creatorcontrib>Harris, Michael</creatorcontrib><creatorcontrib>Jeffrey, Brendan</creatorcontrib><creatorcontrib>Epperson, L Elaine</creatorcontrib><creatorcontrib>Durbin, David</creatorcontrib><creatorcontrib>Strong, Michael</creatorcontrib><creatorcontrib>Salfinger, Max</creatorcontrib><creatorcontrib>Iqbal, Zamin</creatorcontrib><creatorcontrib>Akhundova, Irada</creatorcontrib><creatorcontrib>Vashakidze, Sergo</creatorcontrib><creatorcontrib>Crudu, Valeriu</creatorcontrib><creatorcontrib>Rosenthal, Alex</creatorcontrib><creatorcontrib>Farhat, Maha Reda</creatorcontrib><title>Benchmarking the empirical accuracy of short-read sequencing across the M. tuberculosis genome</title><title>Bioinformatics</title><addtitle>Bioinformatics</addtitle><description>Abstract Motivation Short-read whole-genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences and sequencing bias reduces the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. To benchmark short-read variant calling, we used 36 diverse clinical Mycobacterium tuberculosis (Mtb) isolates dually sequenced with Illumina short-reads and PacBio long-reads. We systematically studied the short-read variant calling accuracy and the influence of sequence uniqueness, reference bias and GC content. Results Reference-based Illumina variant calling demonstrated a maximum recall of 89.0% and minimum precision of 98.5% across parameters evaluated. The approach that maximized variant recall while still maintaining high precision (&lt;99%) was tuning the mapping quality filtering threshold, i.e. confidence of the read mapping (recall = 85.8%, precision = 99.1%, MQ ≥ 40). Additional masking of repetitive sequence content is an alternative conservative approach to variant calling that increases precision at cost to recall (recall = 70.2%, precision = 99.6%, MQ ≥ 40). Of the genomic positions typically excluded for Mtb, 68% are accurately called using Illumina WGS including 52/168 PE/PPE genes (34.5%). From these results, we present a refined list of low confidence regions across the Mtb genome, which we found to frequently overlap with regions with structural variation, low sequence uniqueness and low sequencing coverage. Our benchmarking results have broad implications for the use of WGS in the study of Mtb biology, inference of transmission in public health surveillance systems and more generally for WGS applications in other organisms. Availability and implementation All relevant code is available at https://github.com/farhat-lab/mtb-illumina-wgs-evaluation. Supplementary information Supplementary data are available at Bioinformatics online.</description><subject>Benchmarking</subject><subject>High-Throughput Nucleotide Sequencing - methods</subject><subject>Humans</subject><subject>Mycobacterium tuberculosis - genetics</subject><subject>Original Papers</subject><subject>Sequence Analysis, DNA - methods</subject><subject>Software</subject><subject>Tuberculosis</subject><issn>1367-4803</issn><issn>1460-2059</issn><issn>1367-4811</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNqNkU9v1DAQxS0Eon_gK1Q5ckk7jh3buSBBVShSEZf2ijXxjncNSbzYSaV--7q7S0VvnMbS_N4bPz3Gzjicc-jERR9imHxMI87B5Yt-RgeNeMWOuVRQN9B2r8tbKF1LA-KIneT8C6DlUsq37Ei00IDuxDH7-Zkmtxkx_Q7Tupo3VNG4DSk4HCp0bknoHqroq7yJaa4T4arK9GcpoiceXYo572Tfz6t56Sm5ZYg55GpNUxzpHXvjccj0_jBP2d2Xq9vL6_rmx9dvl59uaicVn-uVFwYlNrIXncamQV1uExijjeJKC1QauHMteCy8Nr4l7zvDNQhPwLU4ZR_3vtulH2nlaJoTDnabQon2YCMG-3IzhY1dx3trOiXEzuDDwSDFEi_PdgzZ0TDgRHHJtlG8a2VrFBRU7dFd-ET--QwH-1SOfVmOPZRThGf_fvJZ9reNAvA9EJft_5o-AuO3pOY</recordid><startdate>20220328</startdate><enddate>20220328</enddate><creator>Marin, Maximillian</creator><creator>Vargas, Roger</creator><creator>Harris, Michael</creator><creator>Jeffrey, Brendan</creator><creator>Epperson, L Elaine</creator><creator>Durbin, David</creator><creator>Strong, Michael</creator><creator>Salfinger, Max</creator><creator>Iqbal, Zamin</creator><creator>Akhundova, Irada</creator><creator>Vashakidze, Sergo</creator><creator>Crudu, Valeriu</creator><creator>Rosenthal, Alex</creator><creator>Farhat, Maha Reda</creator><general>Oxford University Press</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0001-5059-8002</orcidid><orcidid>https://orcid.org/0000-0002-9108-3328</orcidid></search><sort><creationdate>20220328</creationdate><title>Benchmarking the empirical accuracy of short-read sequencing across the M. tuberculosis genome</title><author>Marin, Maximillian ; Vargas, Roger ; Harris, Michael ; Jeffrey, Brendan ; Epperson, L Elaine ; Durbin, David ; Strong, Michael ; Salfinger, Max ; Iqbal, Zamin ; Akhundova, Irada ; Vashakidze, Sergo ; Crudu, Valeriu ; Rosenthal, Alex ; Farhat, Maha Reda</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c461t-df38a4a24b397a22a7acce0887861673a6701cc50fa46178f5eff981703fe0173</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Benchmarking</topic><topic>High-Throughput Nucleotide Sequencing - methods</topic><topic>Humans</topic><topic>Mycobacterium tuberculosis - genetics</topic><topic>Original Papers</topic><topic>Sequence Analysis, DNA - methods</topic><topic>Software</topic><topic>Tuberculosis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Marin, Maximillian</creatorcontrib><creatorcontrib>Vargas, Roger</creatorcontrib><creatorcontrib>Harris, Michael</creatorcontrib><creatorcontrib>Jeffrey, Brendan</creatorcontrib><creatorcontrib>Epperson, L Elaine</creatorcontrib><creatorcontrib>Durbin, David</creatorcontrib><creatorcontrib>Strong, Michael</creatorcontrib><creatorcontrib>Salfinger, Max</creatorcontrib><creatorcontrib>Iqbal, Zamin</creatorcontrib><creatorcontrib>Akhundova, Irada</creatorcontrib><creatorcontrib>Vashakidze, Sergo</creatorcontrib><creatorcontrib>Crudu, Valeriu</creatorcontrib><creatorcontrib>Rosenthal, Alex</creatorcontrib><creatorcontrib>Farhat, Maha Reda</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Bioinformatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Marin, Maximillian</au><au>Vargas, Roger</au><au>Harris, Michael</au><au>Jeffrey, Brendan</au><au>Epperson, L Elaine</au><au>Durbin, David</au><au>Strong, Michael</au><au>Salfinger, Max</au><au>Iqbal, Zamin</au><au>Akhundova, Irada</au><au>Vashakidze, Sergo</au><au>Crudu, Valeriu</au><au>Rosenthal, Alex</au><au>Farhat, Maha Reda</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Benchmarking the empirical accuracy of short-read sequencing across the M. tuberculosis genome</atitle><jtitle>Bioinformatics</jtitle><addtitle>Bioinformatics</addtitle><date>2022-03-28</date><risdate>2022</risdate><volume>38</volume><issue>7</issue><spage>1781</spage><epage>1787</epage><pages>1781-1787</pages><issn>1367-4803</issn><eissn>1460-2059</eissn><eissn>1367-4811</eissn><abstract>Abstract Motivation Short-read whole-genome sequencing (WGS) is a vital tool for clinical applications and basic research. Genetic divergence from the reference genome, repetitive sequences and sequencing bias reduces the performance of variant calling using short-read alignment, but the loss in recall and specificity has not been adequately characterized. To benchmark short-read variant calling, we used 36 diverse clinical Mycobacterium tuberculosis (Mtb) isolates dually sequenced with Illumina short-reads and PacBio long-reads. We systematically studied the short-read variant calling accuracy and the influence of sequence uniqueness, reference bias and GC content. Results Reference-based Illumina variant calling demonstrated a maximum recall of 89.0% and minimum precision of 98.5% across parameters evaluated. The approach that maximized variant recall while still maintaining high precision (&lt;99%) was tuning the mapping quality filtering threshold, i.e. confidence of the read mapping (recall = 85.8%, precision = 99.1%, MQ ≥ 40). Additional masking of repetitive sequence content is an alternative conservative approach to variant calling that increases precision at cost to recall (recall = 70.2%, precision = 99.6%, MQ ≥ 40). Of the genomic positions typically excluded for Mtb, 68% are accurately called using Illumina WGS including 52/168 PE/PPE genes (34.5%). From these results, we present a refined list of low confidence regions across the Mtb genome, which we found to frequently overlap with regions with structural variation, low sequence uniqueness and low sequencing coverage. Our benchmarking results have broad implications for the use of WGS in the study of Mtb biology, inference of transmission in public health surveillance systems and more generally for WGS applications in other organisms. Availability and implementation All relevant code is available at https://github.com/farhat-lab/mtb-illumina-wgs-evaluation. Supplementary information Supplementary data are available at Bioinformatics online.</abstract><cop>England</cop><pub>Oxford University Press</pub><pmid>35020793</pmid><doi>10.1093/bioinformatics/btac023</doi><tpages>7</tpages><orcidid>https://orcid.org/0000-0001-5059-8002</orcidid><orcidid>https://orcid.org/0000-0002-9108-3328</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1367-4803
ispartof Bioinformatics, 2022-03, Vol.38 (7), p.1781-1787
issn 1367-4803
1460-2059
1367-4811
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_8963317
source Oxford Journals Open Access Collection
subjects Benchmarking
High-Throughput Nucleotide Sequencing - methods
Humans
Mycobacterium tuberculosis - genetics
Original Papers
Sequence Analysis, DNA - methods
Software
Tuberculosis
title Benchmarking the empirical accuracy of short-read sequencing across the M. tuberculosis genome
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T04%3A39%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_TOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Benchmarking%20the%20empirical%20accuracy%20of%20short-read%20sequencing%20across%20the%20M.%20tuberculosis%20genome&rft.jtitle=Bioinformatics&rft.au=Marin,%20Maximillian&rft.date=2022-03-28&rft.volume=38&rft.issue=7&rft.spage=1781&rft.epage=1787&rft.pages=1781-1787&rft.issn=1367-4803&rft.eissn=1460-2059&rft_id=info:doi/10.1093/bioinformatics/btac023&rft_dat=%3Cproquest_TOX%3E2619545860%3C/proquest_TOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2619545860&rft_id=info:pmid/35020793&rft_oup_id=10.1093/bioinformatics/btac023&rfr_iscdi=true