Error-Correcting Codes for Short Tandem Duplication and Edit Errors

Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting t...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on information theory 2022-02, Vol.68 (2), p.871-880
Hauptverfasser: Tang, Yuanyuan, Farnoud, Farzad
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 880
container_issue 2
container_start_page 871
container_title IEEE transactions on information theory
container_volume 68
creator Tang, Yuanyuan
Farnoud, Farzad
description Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplication and edit errors, where an edit error may be a substitution, deletion, or insertion. We focus on tandem repeats of length at most 3 and design codes for correcting an arbitrary number of duplication errors and one edit error. Because an edited symbol can be duplicated many times (as part of substrings of various lengths), a single edit can affect an unbounded substring of the retrieved word. However, we show that with appropriate preprocessing, the effect may be limited to a substring of finite length, thus making efficient error-correction possible. We construct a code for correcting the aforementioned errors and provide lower bounds for its rate. Compared to optimal codes correcting only duplication errors, numerical results show that the asymptotic cost of protecting against an additional edit is only 0.003 bits/symbol when the alphabet has size 4, an important case corresponding to data storage in DNA.
doi_str_mv 10.1109/TIT.2021.3125724
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2621792950</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9605683</ieee_id><sourcerecordid>2621792950</sourcerecordid><originalsourceid>FETCH-LOGICAL-c333t-3d35d09c2317bdd460adbc827c1d827c4982857ddb0ce2ba690e2378d40d7b093</originalsourceid><addsrcrecordid>eNo9kE1LAzEQhoMoWKt3wUvA89bJ1yY5ylq1UPDgeg67mVS3tJuabA_-e7e2eJlhhuedgYeQWwYzxsA-1It6xoGzmWBcaS7PyIQppQtbKnlOJgDMFFZKc0mucl6Po1SMT0g1TymmooopBT90_SetIoZMVzHR96-YBlo3PYYtfdrvNp1vhi72dNzQOXYD_Qvna3KxajY53Jz6lHw8z-vqtVi-vSyqx2XhhRBDIVAoBOu5YLpFlCU02HrDtWd4qNIabpRGbMEH3jalhcCFNigBdQtWTMn98e4uxe99yINbx33qx5eOl5xpy62CkYIj5VPMOYWV26Vu26Qfx8AdVLlRlTuocidVY-TuGOlCCP-4LUGVRohfoBRjiA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2621792950</pqid></control><display><type>article</type><title>Error-Correcting Codes for Short Tandem Duplication and Edit Errors</title><source>IEEE Electronic Library (IEL)</source><creator>Tang, Yuanyuan ; Farnoud, Farzad</creator><creatorcontrib>Tang, Yuanyuan ; Farnoud, Farzad</creatorcontrib><description>Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplication and edit errors, where an edit error may be a substitution, deletion, or insertion. We focus on tandem repeats of length at most 3 and design codes for correcting an arbitrary number of duplication errors and one edit error. Because an edited symbol can be duplicated many times (as part of substrings of various lengths), a single edit can affect an unbounded substring of the retrieved word. However, we show that with appropriate preprocessing, the effect may be limited to a substring of finite length, thus making efficient error-correction possible. We construct a code for correcting the aforementioned errors and provide lower bounds for its rate. Compared to optimal codes correcting only duplication errors, numerical results show that the asymptotic cost of protecting against an additional edit is only 0.003 bits/symbol when the alphabet has size 4, an important case corresponding to data storage in DNA.</description><identifier>ISSN: 0018-9448</identifier><identifier>EISSN: 1557-9654</identifier><identifier>DOI: 10.1109/TIT.2021.3125724</identifier><identifier>CODEN: IETTAW</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Codes ; Data storage ; DNA ; DNA data storage ; duplication errors ; edit errors ; Error correcting codes ; Error correction ; Error correction codes ; Gene sequencing ; Lower bounds ; Media ; Memory ; Noise measurement ; Reproduction (copying) ; Sequential analysis ; Task analysis</subject><ispartof>IEEE transactions on information theory, 2022-02, Vol.68 (2), p.871-880</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c333t-3d35d09c2317bdd460adbc827c1d827c4982857ddb0ce2ba690e2378d40d7b093</citedby><cites>FETCH-LOGICAL-c333t-3d35d09c2317bdd460adbc827c1d827c4982857ddb0ce2ba690e2378d40d7b093</cites><orcidid>0000-0003-2946-7782 ; 0000-0002-8684-4487</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9605683$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9605683$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Tang, Yuanyuan</creatorcontrib><creatorcontrib>Farnoud, Farzad</creatorcontrib><title>Error-Correcting Codes for Short Tandem Duplication and Edit Errors</title><title>IEEE transactions on information theory</title><addtitle>TIT</addtitle><description>Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplication and edit errors, where an edit error may be a substitution, deletion, or insertion. We focus on tandem repeats of length at most 3 and design codes for correcting an arbitrary number of duplication errors and one edit error. Because an edited symbol can be duplicated many times (as part of substrings of various lengths), a single edit can affect an unbounded substring of the retrieved word. However, we show that with appropriate preprocessing, the effect may be limited to a substring of finite length, thus making efficient error-correction possible. We construct a code for correcting the aforementioned errors and provide lower bounds for its rate. Compared to optimal codes correcting only duplication errors, numerical results show that the asymptotic cost of protecting against an additional edit is only 0.003 bits/symbol when the alphabet has size 4, an important case corresponding to data storage in DNA.</description><subject>Codes</subject><subject>Data storage</subject><subject>DNA</subject><subject>DNA data storage</subject><subject>duplication errors</subject><subject>edit errors</subject><subject>Error correcting codes</subject><subject>Error correction</subject><subject>Error correction codes</subject><subject>Gene sequencing</subject><subject>Lower bounds</subject><subject>Media</subject><subject>Memory</subject><subject>Noise measurement</subject><subject>Reproduction (copying)</subject><subject>Sequential analysis</subject><subject>Task analysis</subject><issn>0018-9448</issn><issn>1557-9654</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1LAzEQhoMoWKt3wUvA89bJ1yY5ylq1UPDgeg67mVS3tJuabA_-e7e2eJlhhuedgYeQWwYzxsA-1It6xoGzmWBcaS7PyIQppQtbKnlOJgDMFFZKc0mucl6Po1SMT0g1TymmooopBT90_SetIoZMVzHR96-YBlo3PYYtfdrvNp1vhi72dNzQOXYD_Qvna3KxajY53Jz6lHw8z-vqtVi-vSyqx2XhhRBDIVAoBOu5YLpFlCU02HrDtWd4qNIabpRGbMEH3jalhcCFNigBdQtWTMn98e4uxe99yINbx33qx5eOl5xpy62CkYIj5VPMOYWV26Vu26Qfx8AdVLlRlTuocidVY-TuGOlCCP-4LUGVRohfoBRjiA</recordid><startdate>20220201</startdate><enddate>20220201</enddate><creator>Tang, Yuanyuan</creator><creator>Farnoud, Farzad</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-2946-7782</orcidid><orcidid>https://orcid.org/0000-0002-8684-4487</orcidid></search><sort><creationdate>20220201</creationdate><title>Error-Correcting Codes for Short Tandem Duplication and Edit Errors</title><author>Tang, Yuanyuan ; Farnoud, Farzad</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c333t-3d35d09c2317bdd460adbc827c1d827c4982857ddb0ce2ba690e2378d40d7b093</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Codes</topic><topic>Data storage</topic><topic>DNA</topic><topic>DNA data storage</topic><topic>duplication errors</topic><topic>edit errors</topic><topic>Error correcting codes</topic><topic>Error correction</topic><topic>Error correction codes</topic><topic>Gene sequencing</topic><topic>Lower bounds</topic><topic>Media</topic><topic>Memory</topic><topic>Noise measurement</topic><topic>Reproduction (copying)</topic><topic>Sequential analysis</topic><topic>Task analysis</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Tang, Yuanyuan</creatorcontrib><creatorcontrib>Farnoud, Farzad</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005–Present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on information theory</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Tang, Yuanyuan</au><au>Farnoud, Farzad</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Error-Correcting Codes for Short Tandem Duplication and Edit Errors</atitle><jtitle>IEEE transactions on information theory</jtitle><stitle>TIT</stitle><date>2022-02-01</date><risdate>2022</risdate><volume>68</volume><issue>2</issue><spage>871</spage><epage>880</epage><pages>871-880</pages><issn>0018-9448</issn><eissn>1557-9654</eissn><coden>IETTAW</coden><abstract>Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging task. This paper aims to address simultaneously correcting two types of errors, namely, short tandem duplication and edit errors, where an edit error may be a substitution, deletion, or insertion. We focus on tandem repeats of length at most 3 and design codes for correcting an arbitrary number of duplication errors and one edit error. Because an edited symbol can be duplicated many times (as part of substrings of various lengths), a single edit can affect an unbounded substring of the retrieved word. However, we show that with appropriate preprocessing, the effect may be limited to a substring of finite length, thus making efficient error-correction possible. We construct a code for correcting the aforementioned errors and provide lower bounds for its rate. Compared to optimal codes correcting only duplication errors, numerical results show that the asymptotic cost of protecting against an additional edit is only 0.003 bits/symbol when the alphabet has size 4, an important case corresponding to data storage in DNA.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TIT.2021.3125724</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0003-2946-7782</orcidid><orcidid>https://orcid.org/0000-0002-8684-4487</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 0018-9448
ispartof IEEE transactions on information theory, 2022-02, Vol.68 (2), p.871-880
issn 0018-9448
1557-9654
language eng
recordid cdi_proquest_journals_2621792950
source IEEE Electronic Library (IEL)
subjects Codes
Data storage
DNA
DNA data storage
duplication errors
edit errors
Error correcting codes
Error correction
Error correction codes
Gene sequencing
Lower bounds
Media
Memory
Noise measurement
Reproduction (copying)
Sequential analysis
Task analysis
title Error-Correcting Codes for Short Tandem Duplication and Edit Errors
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T22%3A29%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Error-Correcting%20Codes%20for%20Short%20Tandem%20Duplication%20and%20Edit%20Errors&rft.jtitle=IEEE%20transactions%20on%20information%20theory&rft.au=Tang,%20Yuanyuan&rft.date=2022-02-01&rft.volume=68&rft.issue=2&rft.spage=871&rft.epage=880&rft.pages=871-880&rft.issn=0018-9448&rft.eissn=1557-9654&rft.coden=IETTAW&rft_id=info:doi/10.1109/TIT.2021.3125724&rft_dat=%3Cproquest_RIE%3E2621792950%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2621792950&rft_id=info:pmid/&rft_ieee_id=9605683&rfr_iscdi=true