Computing Maximal Covers for Protein Sequences

A partial cover of a string or sequence of length n , which we model as an array x = x [ 1 . . n ] , is a repeating substring u of x such that “many” positions in x lie within occurrences of u . A maximal cover u* —introduced in 2018 by Mhaskar and Smyth as optimal cover —is a partial cover that, ov...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of computational biology 2023-02, Vol.30 (2), p.149-160
Hauptverfasser:	Golding, G Brian, Koponen, Holly, Mhaskar, Neerja, Smyth, W F
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Amino Acid Sequence Animals Drosophila melanogaster Humans MatBio 2021 Special Section Proteins Software
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	A partial cover of a string or sequence of length n , which we model as an array x = x [ 1 . . n ] , is a repeating substring u of x such that “many” positions in x lie within occurrences of u . A maximal cover u* —introduced in 2018 by Mhaskar and Smyth as optimal cover —is a partial cover that, over all partial covers u , maximizes the positions covered. Applying data structures also introduced by Mhaskar and Smyth, our software MAXCOVER for the first time enables efficient computation of u* for any x —in particular, as described here, for protein sequences of Arabidopsis, Caenorhabditis elegans, Drosophila melanogaster , and humans. In this protein context, we also compare an extended version of MAXCOVER with existing software (MUMmer's repeat-match) for the closely related task of computing non-extendible repeating substrings (a.k.a. maximal repeats ). In practice, MAXCOVER is an order-of-magnitude faster than MUMmer, with much lower space requirements, while producing more compact output that, nevertheless, yields a more exact and user-friendly specification of the repeats.
ISSN:	1557-8666 1557-8666
DOI:	10.1089/cmb.2021.0520