Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers

The rise of single-cell genomics is an attractive opportunity for data-hungry machine learning algorithms. The scBERT method, inspired by the success of BERT (‘bidirectional encoder representations from transformers’) in natural language processing, was recently introduced by Yang et al. as a data-d...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	NATURE MACHINE INTELLIGENCE 2023-12, Vol.5 (12), p.1437-1446
Hauptverfasser:	Khan, Sumeer Ahmad, Maillo, Alberto, Lagani, Vincenzo, Lehmann, Robert, Kiani, Narsis A., Gomez-Cabrero, David, Tegner, Jesper
Format:	Artikel
Sprache:	eng
Schlagworte:	631/114 706/648 Accuracy Algorithms Annotations Cells Datasets Engineering Gene expression Gene sequencing Genomics Machine learning Medicin och hälsovetenskap Natural language Natural language processing Neural networks
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	The rise of single-cell genomics is an attractive opportunity for data-hungry machine learning algorithms. The scBERT method, inspired by the success of BERT (‘bidirectional encoder representations from transformers’) in natural language processing, was recently introduced by Yang et al. as a data-driven tool to annotate cell types in single-cell genomics data. Analogous to contextual embedding in BERT, scBERT leverages pretraining and self-attention mechanisms to learn the ‘transcriptional grammar’ of cells. Here we investigate the reusability beyond the original datasets, assessing the generalizability of natural language techniques in single-cell genomics. The degree of imbalance in the cell-type distribution substantially influences the performance of scBERT. Anticipating an increased utilization of transformers, we highlight the necessity to consider data distribution carefully and introduce a subsampling technique to mitigate the influence of an imbalanced distribution. Our analysis serves as a stepping stone towards understanding and optimizing the use of transformers in single-cell genomics. scBERT, a pretrained neural network for single-cell sequencing tasks, was published last year in Nature Machine Intelligence . To test the reusability of the method, Khan et al. use the code to assess the generalizablility of transformer architectures on single-cell genomics tasks.
ISSN:	2522-5839 2522-5839
DOI:	10.1038/s42256-023-00757-8