Computing Maximal Covers for Protein Sequences
A partial cover of a string or sequence of length n , which we model as an array x = x [ 1 . . n ] , is a repeating substring u of x such that “many” positions in x lie within occurrences of u . A maximal cover u* —introduced in 2018 by Mhaskar and Smyth as optimal cover —is a partial cover that, ov...
Gespeichert in:
Veröffentlicht in: | Journal of computational biology 2023-02, Vol.30 (2), p.149-160 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | A
partial cover
of a
string
or
sequence
of length
n
, which we model as an array
x
=
x
[
1
.
.
n
]
, is a repeating substring
u
of
x
such that “many” positions in
x
lie within occurrences of
u
. A
maximal cover u*
—introduced in 2018 by Mhaskar and Smyth as
optimal cover
—is a partial cover that, over all partial covers
u
, maximizes the positions covered. Applying data structures also introduced by Mhaskar and Smyth, our software MAXCOVER for the first time enables efficient computation of
u*
for any
x
—in particular, as described here, for protein sequences of Arabidopsis,
Caenorhabditis elegans, Drosophila melanogaster
, and humans. In this protein context, we also compare an extended version of MAXCOVER with existing software (MUMmer's repeat-match) for the closely related task of computing non-extendible repeating substrings (a.k.a.
maximal repeats
). In practice, MAXCOVER is an order-of-magnitude faster than MUMmer, with much lower space requirements, while producing more compact output that, nevertheless, yields a more exact and user-friendly specification of the repeats. |
---|---|
ISSN: | 1557-8666 1557-8666 |
DOI: | 10.1089/cmb.2021.0520 |