Computing Maximal Covers for Protein Sequences

A partial cover of a string or sequence of length n , which we model as an array x = x [ 1 . . n ] , is a repeating substring u of x such that “many” positions in x lie within occurrences of u . A maximal cover u* —introduced in 2018 by Mhaskar and Smyth as optimal cover —is a partial cover that, ov...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of computational biology 2023-02, Vol.30 (2), p.149-160
Hauptverfasser: Golding, G Brian, Koponen, Holly, Mhaskar, Neerja, Smyth, W F
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:A partial cover of a string or sequence of length n , which we model as an array x = x [ 1 . . n ] , is a repeating substring u of x such that “many” positions in x lie within occurrences of u . A maximal cover u* —introduced in 2018 by Mhaskar and Smyth as optimal cover —is a partial cover that, over all partial covers u , maximizes the positions covered. Applying data structures also introduced by Mhaskar and Smyth, our software MAXCOVER for the first time enables efficient computation of u* for any x —in particular, as described here, for protein sequences of Arabidopsis, Caenorhabditis elegans, Drosophila melanogaster , and humans. In this protein context, we also compare an extended version of MAXCOVER with existing software (MUMmer's repeat-match) for the closely related task of computing non-extendible repeating substrings (a.k.a. maximal repeats ). In practice, MAXCOVER is an order-of-magnitude faster than MUMmer, with much lower space requirements, while producing more compact output that, nevertheless, yields a more exact and user-friendly specification of the repeats.
ISSN:1557-8666
1557-8666
DOI:10.1089/cmb.2021.0520