HiCat: A Semi-Supervised Approach for Cell Type Annotation

We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Bi, Chang, Bai, Kailun, Li, Xing, Zhang, Xuekui
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Bi, Chang
Bai, Kailun
Li, Xing
Zhang, Xuekui
description We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This hybrid approach incorporates both reference and query genomic data for feature engineering, enhancing the embedding learning process, increasing the effective sample size for unsupervised techniques, and improving the transferability of the supervised model trained on reference data when applied to query datasets. The pipeline follows six key steps: (1) removing batch effects using Harmony to generate a 50-dimensional principal component embedding; (2) applying UMAP for dimensionality reduction to two dimensions to capture crucial data patterns; (3) conducting unsupervised clustering of cells with DBSCAN, yielding a one-dimensional cluster membership vector; (4) merging the multi-resolution results of the previous steps into a 53-dimensional feature space that encompasses both reference and query data; (5) training a CatBoost model on the reference dataset to predict cell types in the query dataset; and (6) resolving inconsistencies between the supervised predictions and unsupervised cluster labels. When benchmarked on 10 publicly available genomic datasets, HiCat surpasses other methods, particularly in differentiating and identifying multiple new cell types. Its capacity to accurately classify novel cell types showcases its robustness and adaptability within intricate biological datasets.
doi_str_mv 10.48550/arxiv.2412.06805
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_06805</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_06805</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_068053</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwszAw5WSw8sh0TiyxUnBUCE7NzdQNLi1ILSrLLE5NUXAsKCjKT0zOUEjLL1JwTs3JUQipLEhVcMzLyy9JLMnMz-NhYE1LzClO5YXS3Azybq4hzh66YFviC4oycxOLKuNBtsWDbTMmrAIABuAzfQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>HiCat: A Semi-Supervised Approach for Cell Type Annotation</title><source>arXiv.org</source><creator>Bi, Chang ; Bai, Kailun ; Li, Xing ; Zhang, Xuekui</creator><creatorcontrib>Bi, Chang ; Bai, Kailun ; Li, Xing ; Zhang, Xuekui</creatorcontrib><description>We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This hybrid approach incorporates both reference and query genomic data for feature engineering, enhancing the embedding learning process, increasing the effective sample size for unsupervised techniques, and improving the transferability of the supervised model trained on reference data when applied to query datasets. The pipeline follows six key steps: (1) removing batch effects using Harmony to generate a 50-dimensional principal component embedding; (2) applying UMAP for dimensionality reduction to two dimensions to capture crucial data patterns; (3) conducting unsupervised clustering of cells with DBSCAN, yielding a one-dimensional cluster membership vector; (4) merging the multi-resolution results of the previous steps into a 53-dimensional feature space that encompasses both reference and query data; (5) training a CatBoost model on the reference dataset to predict cell types in the query dataset; and (6) resolving inconsistencies between the supervised predictions and unsupervised cluster labels. When benchmarked on 10 publicly available genomic datasets, HiCat surpasses other methods, particularly in differentiating and identifying multiple new cell types. Its capacity to accurately classify novel cell types showcases its robustness and adaptability within intricate biological datasets.</description><identifier>DOI: 10.48550/arxiv.2412.06805</identifier><language>eng</language><subject>Computer Science - Learning ; Quantitative Biology - Biomolecules</subject><creationdate>2024-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.06805$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.06805$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Bi, Chang</creatorcontrib><creatorcontrib>Bai, Kailun</creatorcontrib><creatorcontrib>Li, Xing</creatorcontrib><creatorcontrib>Zhang, Xuekui</creatorcontrib><title>HiCat: A Semi-Supervised Approach for Cell Type Annotation</title><description>We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This hybrid approach incorporates both reference and query genomic data for feature engineering, enhancing the embedding learning process, increasing the effective sample size for unsupervised techniques, and improving the transferability of the supervised model trained on reference data when applied to query datasets. The pipeline follows six key steps: (1) removing batch effects using Harmony to generate a 50-dimensional principal component embedding; (2) applying UMAP for dimensionality reduction to two dimensions to capture crucial data patterns; (3) conducting unsupervised clustering of cells with DBSCAN, yielding a one-dimensional cluster membership vector; (4) merging the multi-resolution results of the previous steps into a 53-dimensional feature space that encompasses both reference and query data; (5) training a CatBoost model on the reference dataset to predict cell types in the query dataset; and (6) resolving inconsistencies between the supervised predictions and unsupervised cluster labels. When benchmarked on 10 publicly available genomic datasets, HiCat surpasses other methods, particularly in differentiating and identifying multiple new cell types. Its capacity to accurately classify novel cell types showcases its robustness and adaptability within intricate biological datasets.</description><subject>Computer Science - Learning</subject><subject>Quantitative Biology - Biomolecules</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwszAw5WSw8sh0TiyxUnBUCE7NzdQNLi1ILSrLLE5NUXAsKCjKT0zOUEjLL1JwTs3JUQipLEhVcMzLyy9JLMnMz-NhYE1LzClO5YXS3Azybq4hzh66YFviC4oycxOLKuNBtsWDbTMmrAIABuAzfQ</recordid><startdate>20241124</startdate><enddate>20241124</enddate><creator>Bi, Chang</creator><creator>Bai, Kailun</creator><creator>Li, Xing</creator><creator>Zhang, Xuekui</creator><scope>AKY</scope><scope>ALC</scope><scope>GOX</scope></search><sort><creationdate>20241124</creationdate><title>HiCat: A Semi-Supervised Approach for Cell Type Annotation</title><author>Bi, Chang ; Bai, Kailun ; Li, Xing ; Zhang, Xuekui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_068053</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><topic>Quantitative Biology - Biomolecules</topic><toplevel>online_resources</toplevel><creatorcontrib>Bi, Chang</creatorcontrib><creatorcontrib>Bai, Kailun</creatorcontrib><creatorcontrib>Li, Xing</creatorcontrib><creatorcontrib>Zhang, Xuekui</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bi, Chang</au><au>Bai, Kailun</au><au>Li, Xing</au><au>Zhang, Xuekui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>HiCat: A Semi-Supervised Approach for Cell Type Annotation</atitle><date>2024-11-24</date><risdate>2024</risdate><abstract>We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This hybrid approach incorporates both reference and query genomic data for feature engineering, enhancing the embedding learning process, increasing the effective sample size for unsupervised techniques, and improving the transferability of the supervised model trained on reference data when applied to query datasets. The pipeline follows six key steps: (1) removing batch effects using Harmony to generate a 50-dimensional principal component embedding; (2) applying UMAP for dimensionality reduction to two dimensions to capture crucial data patterns; (3) conducting unsupervised clustering of cells with DBSCAN, yielding a one-dimensional cluster membership vector; (4) merging the multi-resolution results of the previous steps into a 53-dimensional feature space that encompasses both reference and query data; (5) training a CatBoost model on the reference dataset to predict cell types in the query dataset; and (6) resolving inconsistencies between the supervised predictions and unsupervised cluster labels. When benchmarked on 10 publicly available genomic datasets, HiCat surpasses other methods, particularly in differentiating and identifying multiple new cell types. Its capacity to accurately classify novel cell types showcases its robustness and adaptability within intricate biological datasets.</abstract><doi>10.48550/arxiv.2412.06805</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2412.06805
ispartof
issn
language eng
recordid cdi_arxiv_primary_2412_06805
source arXiv.org
subjects Computer Science - Learning
Quantitative Biology - Biomolecules
title HiCat: A Semi-Supervised Approach for Cell Type Annotation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T13%3A07%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=HiCat:%20A%20Semi-Supervised%20Approach%20for%20Cell%20Type%20Annotation&rft.au=Bi,%20Chang&rft.date=2024-11-24&rft_id=info:doi/10.48550/arxiv.2412.06805&rft_dat=%3Carxiv_GOX%3E2412_06805%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true