A novel deep contrastive convolutional autoencoder based binning approach for taxonomic independent metagenomics data

In this study, we present an innovative binning approach for metagenomics data that combines Natural Language Processing (NLP) with a Deep Contrastive Convolutional Autoencoder (DCAE). We used NLP for feature extraction, specifically focusing on Tetra-nucleotide frequency (TNF) through CountVec and...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Journal of plant biochemistry and biotechnology 2024-12, Vol.33 (4), p.547-557
Hauptverfasser:	Madival, Sharanbasappa D., Jha, Girish Kumar, Mishra, Dwijesh Chandra, Kumar, Sunil, Budhlakoti, Neeraj, Sharma, Anu, Chaturvedi, Krishna Kumar, Kabilan, S., Farooqi, Mohammad Samir, Srivastava, Sudhir
Format:	Artikel
Sprache:	eng
Schlagworte:	Biomedical and Life Sciences Cell Biology Cluster analysis Clustering Datasets Feature extraction Life Sciences Metagenomics Natural language processing Nucleotides Original Article Performance assessment Plant Biochemistry Protein Science Receptors Representations Vector quantization
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	In this study, we present an innovative binning approach for metagenomics data that combines Natural Language Processing (NLP) with a Deep Contrastive Convolutional Autoencoder (DCAE). We used NLP for feature extraction, specifically focusing on Tetra-nucleotide frequency (TNF) through CountVec and (Term Frequency -Inverse Document Frequency) TF-IDF, further enriched by integrating GC-Content into their respective feature matrices. The DCAE, equipped with advanced convolutional layers and a contrastive loss function, excels at capturing intricate patterns in the data, providing a sophisticated representation for binning. By applying k-means clustering to the latent representations obtained from the DCAE, our approach consistently achieves impressive results. To assess the performance of our method, we utilized three standard benchmark metagenomics datasets: 10s, 25s, and Sharon datasets. Across all datasets, we observed Silhouette Indices exceeding 0.6 and Rand Indices surpassing 0.8, demonstrating the superior performance of our proposed method. Compared to existing methodologies, our approach not only surpasses the Rand Index and Silhouette Index of current unsupervised methods but also performs on par with semi-supervised methods across datasets. This underscores the effectiveness and versatility of our approach in metagenomics analysis.
ISSN:	0971-7811 0974-1275
DOI:	10.1007/s13562-024-00911-2