Co-occurrence pattern mining based on a biological approximation scoring matrix

Mining co-occurrence frequency patterns from multiple sequences is a hot topic in bioinformatics. Many seemingly disorganized constituents repetitively appear under different biological matrices, such as PAM250 and BLOSUM62, which are considered hidden frequent patterns ( FPs ). A hidden FP with bot...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern analysis and applications : PAA 2018-11, Vol.21 (4), p.977-996
Hauptverfasser: Guo, Dan, Yuan, Ermao, Hu, Xuegang, Wu, Xindong
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Mining co-occurrence frequency patterns from multiple sequences is a hot topic in bioinformatics. Many seemingly disorganized constituents repetitively appear under different biological matrices, such as PAM250 and BLOSUM62, which are considered hidden frequent patterns ( FPs ). A hidden FP with both gap and flexible approximation operations (replacement, deletion or insertion) deepens the difficulty in discovering its true occurrences. To effectively discover co-occurrence FP s ( Co-FPs ) under these conditions, we design a mining algorithm ( co-fp-miner ) using the following steps: (1) a biological approximation scoring matrix is designed to discover various deformations of a single FP pattern; (2) a data-driven intersection tactic is used to generate candidate Co-FPs ; (3) a deterministic Apriori-like rule is proposed to prune unnecessary Co-FPs ; and (4) finally, we employ a backtracking matching scheme to validate true Co-FPs . The co-fp-miner algorithm is an unified framework for both exact and approximate mining on multiple sequences. Experiments on DNA and protein sequences demonstrate that co-fp-miner is more efficient on solutions, time and memory consumption than that of other peers.
ISSN:1433-7541
1433-755X
DOI:10.1007/s10044-017-0609-8