Aspects of NCBI GenBank as a Biodiversity Information Resource

DNA sequencing of museum specimens, also known as museomics, provides new insights into the study of biodiversity, including taxonomy, phylogeny, and environmental studies. Also, sequencing specimens have led to the rediscovery of extinct species (Suzuki et al. 2016), identification of related speci...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Biodiversity Information Science and Standards 2024-09, Vol.8 (e129438), p.213
1. Verfasser: Nakazato, Takeru
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:DNA sequencing of museum specimens, also known as museomics, provides new insights into the study of biodiversity, including taxonomy, phylogeny, and environmental studies. Also, sequencing specimens have led to the rediscovery of extinct species (Suzuki et al. 2016), identification of related species (Waku et al. 2016), and analysis of ancient DNA (Kanzawa-Kiriyama et al. 2016). Nucleotide sequence data have been collected for more than 30 years under the framework of the International Nucleotide Sequence Database Collaboration (INSDC) by three institutes, namely, National Center for Biotechnology Information, US (NCBI), European Bioinformatics Institute (EBI), and DNA Data Bank of Japan (DDBJ) (Arita et al. 2020). NCBI has collated a database of sequence data, GenBank, which contains approximately 494 million sequences as of April 2022 (Sayers et al. 2021). In fact, GenBank is designed with qualifiers to describe various types of biodiversity information such as "/specimen_voucher", "/lat_lon" (latitude and longitude) and "/collection_date". Also, INSDC now requires that all submissions include the sampling location and date (INSDC 2023). I surveyed the biodiversity information assigned to GenBank records to determine the potential of GenBank as a biodiversity resource. I downloaded all GenBank data as of August 2023 from the FTP site. The “/specimen_voucher” qualifier was introduced to describe specimen ID in Release 104 in December 1997. This qualifier was designed to fill the value in free text: for example, /specimen_voucher="Smith s. n. 4-IV-1995 (U. S. Natl. Herbarium)". After Release 162 in October 2007, a method of writing with a structured value of "[: [:]] " was added (institution-code and collection-code are optional). There are 527,215 records (37.8%) with "/specimen_voucher" qualifier for fish, 3,096,112 records (40.3%) for insects, 1,505,556 records (39.0%) for flowering plants. But fewer than 10% of records have specimen IDs listed using this structured description. To utilize these ambiguous specimen IDs in GenBank, these IDs may need to be cleansed using databases such as NCBI BioCollections, GRSciColl (Global Registry of Scientific Collections) or AI to map them to IDs in databases rich in specimen information such as those of the Global Biodiveristy Information Facility (GBIF) and Barcode of Life System (BOLD). In GenBank, the BOLD ID is listed in the /db_xref qualifier in the “Features” field as the
ISSN:2535-0897
2535-0897
DOI:10.3897/biss.8.137771