Lost in translation: the pitfalls of Ensembl gene annotations between human genome assemblies and their impact on diagnostics
Gene models based on GRCh37 human genome assembly are preferred by many international projects over other updated assemblies (GRCh38 and T2T). Discrepant genes (DGs), those recognized as protein coding in the new but not the old assembly, are ignored by several genomic resources and discarded by var...
Gespeichert in:
Hauptverfasser: | , , |
---|---|
Format: | Dataset |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Gene models based on GRCh37 human genome assembly are preferred by many international projects over other updated assemblies (GRCh38 and T2T). Discrepant genes (DGs), those recognized as protein coding in the new but not the old assembly, are ignored by several genomic resources and discarded by variant prioritization tools relying on information based on GRCh37. We curated a set of Ensembl genes with discrepant annotations between GRCh37 and GRCh38, additionally matching their RefSeq transcripts. Furthermore, we examined their clinical and phenotypic relevance. A total of 337 genes were reclassified as ‘protein-coding’ in GRCh38 but not in GRCh37, with 194 having a discrepant HGNC gene symbol. Many remain missing from the current known RefSeq gene models (N = 73). We found many clinically relevant genes in this group of neglected genes, and we anticipate that many more will be found relevant in the future. Important additional annotations such as evolutionary constraint metrics are also not calculated for these genes, further relegating them into oblivion. For discrepant genes, the inaccurate label of ‘non-protein-coding’ has relevant ramifications on clinical genetics. Accurate collation of these genes allows for manual curation in clinically relevant scenarios. |
---|---|
DOI: | 10.6084/m9.figshare.23709768 |