HiCat: A Semi-Supervised Approach for Cell Type Annotation
We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings),
a novel semi-supervised pipeline for annotating cell types from single-cell RNA
sequencing data. HiCat fuses the strengths of supervised learning for known
cell types with unsupervised learning to identify novel types. This hybrid
approach incorporates both reference and query genomic data for feature
engineering, enhancing the embedding learning process, increasing the effective
sample size for unsupervised techniques, and improving the transferability of
the supervised model trained on reference data when applied to query datasets.
The pipeline follows six key steps: (1) removing batch effects using Harmony to
generate a 50-dimensional principal component embedding; (2) applying UMAP for
dimensionality reduction to two dimensions to capture crucial data patterns;
(3) conducting unsupervised clustering of cells with DBSCAN, yielding a
one-dimensional cluster membership vector; (4) merging the multi-resolution
results of the previous steps into a 53-dimensional feature space that
encompasses both reference and query data; (5) training a CatBoost model on the
reference dataset to predict cell types in the query dataset; and (6) resolving
inconsistencies between the supervised predictions and unsupervised cluster
labels. When benchmarked on 10 publicly available genomic datasets, HiCat
surpasses other methods, particularly in differentiating and identifying
multiple new cell types. Its capacity to accurately classify novel cell types
showcases its robustness and adaptability within intricate biological datasets. |
---|---|
DOI: | 10.48550/arxiv.2412.06805 |