Generating bulk RNA-Seq gene expression data based on generative deep learning models and utilizing it for data augmentation

Large-scale high-throughput transcriptome sequencing data holds significant value in biomedical research. However, practical challenges such as difficulty in sample acquisition often limit the availability of large sample sizes, leading to decreased reliability of the analysis results. In practice,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Computers in biology and medicine 2024-02, Vol.169, p.107828, Article 107828
Hauptverfasser:	Wang, Yinglun, Chen, Qiurui, Shao, Hongwei, Zhang, Rongxin, Shen, Han
Format:	Artikel
Sprache:	eng
Schlagworte:	Availability Classification Data analysis Data augmentation Datasets Deep Learning Diffusion models Gene expression Generative adversarial networks Generative learning High-Throughput Nucleotide Sequencing Machine learning Medical research Neural networks Proteins Reliability analysis Reproducibility of Results Ribonucleic acid RNA RNA-Seq Sample size Standard scores Transcriptome Transcriptomes
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Large-scale high-throughput transcriptome sequencing data holds significant value in biomedical research. However, practical challenges such as difficulty in sample acquisition often limit the availability of large sample sizes, leading to decreased reliability of the analysis results. In practice, generative deep learning models, such as Generative Adversarial Networks (GANs) and Diffusion Models (DMs), have been proven to generate realistic data and may be used to solve this promblem. In this study, we utilized bulk RNA-Seq gene expression data to construct different generative models with two data preprocessing methods: Min-Max-GAN, Z-Score-GAN, Min-Max-DM, and Z-Score-DM. We demonstrated that the generated data from the Min-Max-GAN model exhibited high similarity to real data, surpassing the performance of the other models significantly. Furthermore, we trained the models on the largest dataset available to date, achieving MMD (Maximum Mean Discrepancy) of 0.030 and 0.033 on the training and independent datasets, respectively. Through SHAP (SHapley Additive exPlanations) explanations of our generative model, we also enhanced our model's credibility. Finally, we applied the generated data to data augmentation and observed a significant improvement in the performance of classification models. In summary, this study establishes a GAN-based approach for generating bulk RNA-Seq gene expression data, which contributes to enhancing the performance and reliability of downstream tasks in high-throughput transcriptome analysis. •Demonstrating that Generative Adversarial Networks (GANs) outperform Diffusion Models (DMs) in generating bulk RNA-Seq data.•Developing a GAN-based framework for generating bulk RNA-Seq data that accurately captures the expression of vital genes.•Providing a feasible method for data augmentation to enhance the performance of classification model on bulk RNA-Seq data.
ISSN:	0010-4825 1879-0534 1879-0534
DOI:	10.1016/j.compbiomed.2023.107828