Using statistical analysis to explore the influencing factors of data imbalance for machine learning identification methods of human transcriptome m6A modification sites

RNA methylation, particularly through m6A modification, represents a crucial epigenetic mechanism that governs gene expression and influences a range of biological functions. Accurate identification of methylation sites is crucial for understanding their biological functions. Traditional experimenta...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computational biology and chemistry 2025-01, Vol.115, p.108351, Article 108351
Hauptverfasser:	Li, Mingxin, Li, Rujun, Zhang, Yichi, Peng, Shiyu, Lv, Zhibin
Format:	Artikel
Sprache:	eng
Schlagworte:	data imbalance human transcriptome m6A methylation resampling statistical test
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	RNA methylation, particularly through m6A modification, represents a crucial epigenetic mechanism that governs gene expression and influences a range of biological functions. Accurate identification of methylation sites is crucial for understanding their biological functions. Traditional experimental methods, however, are often costly and can be influenced by experimental conditions, making machine learning, especially deep learning techniques, a vital tool for m6A site identification. Despite their utility, current machine learning models struggle with unbalanced datasets, a common issue in bioinformatics. This study addresses the RNA methylation site data imbalance problem from three key perspectives: feature encoding representation, deep learning models, and data resampling strategies. Using the K-mer one-hot encoding strategy, we effectively extracted RNA sequence features and developed classification prediction models utilizing long short-term memory networks (LSTM) and its variant, Multiplicative LSTM (mLSTM). We further enhanced model performance by ensemble and weighted strategy models. Additionally, we utilized the sequence generative adversarial network (SeqGAN) and the synthetic minority resampling technique (SMOTE) to construct balanced datasets for RNA methylation sites. The prediction results were rigorously analyzed using the Wilcoxon test and multivariate linear regression to explore the effects of different K-mer values, model architectures, and sampling methods on classification outcomes. The analysis underscored the significant impact of feature selection, model architecture, and sampling techniques in addressing data imbalance. Notably, the optimal prediction performance was achieved with a K value of 5 using the mLSTM-ensemble model. These findings not only offer new insights and methodologies for RNA methylation site identification but also provide valuable guidance for addressing similar challenges in bioinformatics. [Display omitted] •Key factors for solving the data imbalance problem in m6A sites identification via deep learning.•Sequence representation features with the greatest impact on the models trained based on imbalanced datasets.•Multiplicative long short-term memory networks ensemble model as a novel approach for m6A sites identification.•Metrics and statistical tests for evaluating model performance on imbalanced datasets comprehensively.
ISSN:	1476-9271 1476-928X 1476-928X
DOI:	10.1016/j.compbiolchem.2025.108351