Co-occurrence pattern mining based on a biological approximation scoring matrix

Mining co-occurrence frequency patterns from multiple sequences is a hot topic in bioinformatics. Many seemingly disorganized constituents repetitively appear under different biological matrices, such as PAM250 and BLOSUM62, which are considered hidden frequent patterns ( FPs ). A hidden FP with bot...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Pattern analysis and applications : PAA 2018-11, Vol.21 (4), p.977-996
Hauptverfasser:	Guo, Dan, Yuan, Ermao, Hu, Xuegang, Wu, Xindong
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Approximation Bioinformatics Computer Science Data mining Deformation Deletion Deoxyribonucleic acid DNA Gene sequencing Mathematical analysis Pattern analysis Pattern Recognition Proteins Theoretical Advances
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Mining co-occurrence frequency patterns from multiple sequences is a hot topic in bioinformatics. Many seemingly disorganized constituents repetitively appear under different biological matrices, such as PAM250 and BLOSUM62, which are considered hidden frequent patterns ( FPs ). A hidden FP with both gap and flexible approximation operations (replacement, deletion or insertion) deepens the difficulty in discovering its true occurrences. To effectively discover co-occurrence FP s ( Co-FPs ) under these conditions, we design a mining algorithm ( co-fp-miner ) using the following steps: (1) a biological approximation scoring matrix is designed to discover various deformations of a single FP pattern; (2) a data-driven intersection tactic is used to generate candidate Co-FPs ; (3) a deterministic Apriori-like rule is proposed to prune unnecessary Co-FPs ; and (4) finally, we employ a backtracking matching scheme to validate true Co-FPs . The co-fp-miner algorithm is an unified framework for both exact and approximate mining on multiple sequences. Experiments on DNA and protein sequences demonstrate that co-fp-miner is more efficient on solutions, time and memory consumption than that of other peers.
ISSN:	1433-7541 1433-755X
DOI:	10.1007/s10044-017-0609-8