An efficient and accurate approach to identify similarities between biological sequences using pair amino acid composition and physicochemical properties

Our study presents a novel method for analyzing biological sequences, utilizing Pairwise Amino Acid Composition and Amino Acid physicochemical properties to construct a feature vector. This step is pivotal, as by utilizing pairwise analysis, we consider the order of amino acids, thereby capturing su...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Soft computing (Berlin, Germany) Germany), 2024, Vol.28 (17-18), p.9341-9357
Hauptverfasser: Hooshyar, L., Hernández-Jiménez, M. B., Khastan, A., Vasighi, M.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Our study presents a novel method for analyzing biological sequences, utilizing Pairwise Amino Acid Composition and Amino Acid physicochemical properties to construct a feature vector. This step is pivotal, as by utilizing pairwise analysis, we consider the order of amino acids, thereby capturing subtle nuances in sequence structure. Simultaneously, by incorporating physicochemical properties, we ensure that the hidden information encoded within amino acids is not overlooked. Furthermore, by considering both the frequency and order of amino acid pairs, our method mitigates the risk of erroneously clustering different sequences as similar, a common pitfall in older methods. Our approach generates a concise 48-member vector, accommodating sequences of arbitrary lengths efficiently. This compact representation retains essential amino acid-specific features, enhancing the accuracy of sequence analysis. Unlike traditional approaches, our algorithm avoids the introduction of sparse vectors, ensuring the retention of important information. Additionally, we introduce fuzzy equivalence relationships to address uncertainty in the clustering process, enabling a more nuanced and flexible clustering approach that captures the inherent ambiguity in biological data. Despite these advancements, our algorithm is presented in a straightforward manner, ensuring accessibility to researchers with varying levels of computational expertise. This enhancement improves the robustness and interpretability of our method, providing researchers with a comprehensive and user-friendly tool for biological sequence analysis.
ISSN:1432-7643
1433-7479
DOI:10.1007/s00500-024-09834-5