HiCat: A Semi-Supervised Approach for Cell Type Annotation

We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Bi, Chang, Bai, Kailun, Li, Xing, Zhang, Xuekui
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Quantitative Biology - Biomolecules
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Bi, Chang Bai, Kailun Li, Xing Zhang, Xuekui
description	We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This hybrid approach incorporates both reference and query genomic data for feature engineering, enhancing the embedding learning process, increasing the effective sample size for unsupervised techniques, and improving the transferability of the supervised model trained on reference data when applied to query datasets. The pipeline follows six key steps: (1) removing batch effects using Harmony to generate a 50-dimensional principal component embedding; (2) applying UMAP for dimensionality reduction to two dimensions to capture crucial data patterns; (3) conducting unsupervised clustering of cells with DBSCAN, yielding a one-dimensional cluster membership vector; (4) merging the multi-resolution results of the previous steps into a 53-dimensional feature space that encompasses both reference and query data; (5) training a CatBoost model on the reference dataset to predict cell types in the query dataset; and (6) resolving inconsistencies between the supervised predictions and unsupervised cluster labels. When benchmarked on 10 publicly available genomic datasets, HiCat surpasses other methods, particularly in differentiating and identifying multiple new cell types. Its capacity to accurately classify novel cell types showcases its robustness and adaptability within intricate biological datasets.
doi_str_mv	10.48550/arxiv.2412.06805
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_06805</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_06805</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_068053</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwszAw5WSw8sh0TiyxUnBUCE7NzdQNLi1ILSrLLE5NUXAsKCjKT0zOUEjLL1JwTs3JUQipLEhVcMzLyy9JLMnMz-NhYE1LzClO5YXS3Azybq4hzh66YFviC4oycxOLKuNBtsWDbTMmrAIABuAzfQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>HiCat: A Semi-Supervised Approach for Cell Type Annotation</title><source>arXiv.org</source><creator>Bi, Chang ; Bai, Kailun ; Li, Xing ; Zhang, Xuekui</creator><creatorcontrib>Bi, Chang ; Bai, Kailun ; Li, Xing ; Zhang, Xuekui</creatorcontrib><description>We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This hybrid approach incorporates both reference and query genomic data for feature engineering, enhancing the embedding learning process, increasing the effective sample size for unsupervised techniques, and improving the transferability of the supervised model trained on reference data when applied to query datasets. The pipeline follows six key steps: (1) removing batch effects using Harmony to generate a 50-dimensional principal component embedding; (2) applying UMAP for dimensionality reduction to two dimensions to capture crucial data patterns; (3) conducting unsupervised clustering of cells with DBSCAN, yielding a one-dimensional cluster membership vector; (4) merging the multi-resolution results of the previous steps into a 53-dimensional feature space that encompasses both reference and query data; (5) training a CatBoost model on the reference dataset to predict cell types in the query dataset; and (6) resolving inconsistencies between the supervised predictions and unsupervised cluster labels. When benchmarked on 10 publicly available genomic datasets, HiCat surpasses other methods, particularly in differentiating and identifying multiple new cell types. Its capacity to accurately classify novel cell types showcases its robustness and adaptability within intricate biological datasets.</description><identifier>DOI: 10.48550/arxiv.2412.06805</identifier><language>eng</language><subject>Computer Science - Learning ; Quantitative Biology - Biomolecules</subject><creationdate>2024-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.06805$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.06805$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Bi, Chang</creatorcontrib><creatorcontrib>Bai, Kailun</creatorcontrib><creatorcontrib>Li, Xing</creatorcontrib><creatorcontrib>Zhang, Xuekui</creatorcontrib><title>HiCat: A Semi-Supervised Approach for Cell Type Annotation</title><description>We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This hybrid approach incorporates both reference and query genomic data for feature engineering, enhancing the embedding learning process, increasing the effective sample size for unsupervised techniques, and improving the transferability of the supervised model trained on reference data when applied to query datasets. The pipeline follows six key steps: (1) removing batch effects using Harmony to generate a 50-dimensional principal component embedding; (2) applying UMAP for dimensionality reduction to two dimensions to capture crucial data patterns; (3) conducting unsupervised clustering of cells with DBSCAN, yielding a one-dimensional cluster membership vector; (4) merging the multi-resolution results of the previous steps into a 53-dimensional feature space that encompasses both reference and query data; (5) training a CatBoost model on the reference dataset to predict cell types in the query dataset; and (6) resolving inconsistencies between the supervised predictions and unsupervised cluster labels. When benchmarked on 10 publicly available genomic datasets, HiCat surpasses other methods, particularly in differentiating and identifying multiple new cell types. Its capacity to accurately classify novel cell types showcases its robustness and adaptability within intricate biological datasets.</description><subject>Computer Science - Learning</subject><subject>Quantitative Biology - Biomolecules</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwszAw5WSw8sh0TiyxUnBUCE7NzdQNLi1ILSrLLE5NUXAsKCjKT0zOUEjLL1JwTs3JUQipLEhVcMzLyy9JLMnMz-NhYE1LzClO5YXS3Azybq4hzh66YFviC4oycxOLKuNBtsWDbTMmrAIABuAzfQ</recordid><startdate>20241124</startdate><enddate>20241124</enddate><creator>Bi, Chang</creator><creator>Bai, Kailun</creator><creator>Li, Xing</creator><creator>Zhang, Xuekui</creator><scope>AKY</scope><scope>ALC</scope><scope>GOX</scope></search><sort><creationdate>20241124</creationdate><title>HiCat: A Semi-Supervised Approach for Cell Type Annotation</title><author>Bi, Chang ; Bai, Kailun ; Li, Xing ; Zhang, Xuekui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_068053</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Learning</topic><topic>Quantitative Biology - Biomolecules</topic><toplevel>online_resources</toplevel><creatorcontrib>Bi, Chang</creatorcontrib><creatorcontrib>Bai, Kailun</creatorcontrib><creatorcontrib>Li, Xing</creatorcontrib><creatorcontrib>Zhang, Xuekui</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv Quantitative Biology</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Bi, Chang</au><au>Bai, Kailun</au><au>Li, Xing</au><au>Zhang, Xuekui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>HiCat: A Semi-Supervised Approach for Cell Type Annotation</atitle><date>2024-11-24</date><risdate>2024</risdate><abstract>We introduce HiCat (Hybrid Cell Annotation using Transformative embeddings), a novel semi-supervised pipeline for annotating cell types from single-cell RNA sequencing data. HiCat fuses the strengths of supervised learning for known cell types with unsupervised learning to identify novel types. This hybrid approach incorporates both reference and query genomic data for feature engineering, enhancing the embedding learning process, increasing the effective sample size for unsupervised techniques, and improving the transferability of the supervised model trained on reference data when applied to query datasets. The pipeline follows six key steps: (1) removing batch effects using Harmony to generate a 50-dimensional principal component embedding; (2) applying UMAP for dimensionality reduction to two dimensions to capture crucial data patterns; (3) conducting unsupervised clustering of cells with DBSCAN, yielding a one-dimensional cluster membership vector; (4) merging the multi-resolution results of the previous steps into a 53-dimensional feature space that encompasses both reference and query data; (5) training a CatBoost model on the reference dataset to predict cell types in the query dataset; and (6) resolving inconsistencies between the supervised predictions and unsupervised cluster labels. When benchmarked on 10 publicly available genomic datasets, HiCat surpasses other methods, particularly in differentiating and identifying multiple new cell types. Its capacity to accurately classify novel cell types showcases its robustness and adaptability within intricate biological datasets.</abstract><doi>10.48550/arxiv.2412.06805</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2412.06805
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2412_06805
source	arXiv.org
subjects	Computer Science - Learning Quantitative Biology - Biomolecules
title	HiCat: A Semi-Supervised Approach for Cell Type Annotation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-04T13%3A07%3A40IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=HiCat:%20A%20Semi-Supervised%20Approach%20for%20Cell%20Type%20Annotation&rft.au=Bi,%20Chang&rft.date=2024-11-24&rft_id=info:doi/10.48550/arxiv.2412.06805&rft_dat=%3Carxiv_GOX%3E2412_06805%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true