The WM-q multiple exact string matching algorithm for DNA sequences

The string matching algorithms are among the essential fields in computer science, such as text search, intrusion detection systems, fraud detection, sequence search in bioinformatics. The exact string matching algorithms are divided into two parts: single and multiple. Multiple string matching algo...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Computers in biology and medicine 2021-09, Vol.136, p.104656-104656, Article 104656
Hauptverfasser: Karcioglu, Abdullah Ammar, Bulut, Hasan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The string matching algorithms are among the essential fields in computer science, such as text search, intrusion detection systems, fraud detection, sequence search in bioinformatics. The exact string matching algorithms are divided into two parts: single and multiple. Multiple string matching algorithms involve finding elements of the pattern set P in a given input text T. String matching processes should be done in a time-efficient manner for DNA sequences. As the volume of the text T increases and the number of search patterns increases, the total runtime increases. Efficient algorithms should be selected to perform these search operations as soon as possible. In this study, the Wu-Manber algorithm, one of the multiple exact string matching algorithms, is improved. Although the Wu-Manber algorithm is effective, it has some limitations, such as hash collisions. In this study, the WM-q algorithm, a version of the Wu-Manber algorithm based on the perfect hash function for DNA sequences, is proposed. String matching is performed using different block lengths provided by the perfect hash function instead of using the fixed block length as in the traditional Wu-Manber algorithm. The proposed approach has been compared with E. Coli and Human Chromosome1 datasets, frequently used in the literature, using multiple exact string matching algorithms. The proposed algorithm gives better results for performance metrics such as the average runtime, the average number of characters and hash comparisons. [Display omitted] •The perfect hash function that removes hash collisions in DNA sequences has been applied to the WM algorithm.•There is no need to make character comparisons of the q-length suffix and prefix sequences of patterns.•Larger block lengths are used in the range between 3≤q≤8 instead of using the fixed block length B=2 as in the WM algorithm.•If the prefix mismatches, the patterns up to 3≤q≤8 rather than B=2 are shifted on the input text.
ISSN:0010-4825
1879-0534
DOI:10.1016/j.compbiomed.2021.104656