The Future of Natural History Transcription: Navigating AI advancements with VoucherVision and the Specimen Label Transcription Project (SLTP)

Natural history collections are critical reservoirs of biodiversity information but collections staff are constantly grappling with substantial backlogs and limited resources. The task of transcribing specimen label text into searchable databases requires a significant amount of time, manual labor,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Biodiversity Information Science and Standards 2023-09, Vol.7 (8)
Hauptverfasser: Weaver, William, Lough, Kyle, Smith, Stephen, Ruhfel, Brad
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Natural history collections are critical reservoirs of biodiversity information but collections staff are constantly grappling with substantial backlogs and limited resources. The task of transcribing specimen label text into searchable databases requires a significant amount of time, manual labor, and funding. To address this challenge, we introduce VoucherVision, a tool harnessing the capabilities of several Large Language Models (LLMs; Naveed et al. 2023) to augment specimen label transcription. The VoucherVision tool automates laborious components of the transcription process, leveraging an Optical Character Recognition (OCR) system and LLMs to convert unstructured label text into appropriate data formats compatible with database ingestion. VoucherVision uses a combination of structured output parsers and recursive re-prompting strategies to ensure consistency and quality of the LLM-formatted text, significantly reducing errors. Integration of VoucherVision with the University of Michigan Herbarium’s transcription workflow resulted in a significant reduction in per-image transcription time, suggesting significant potential advantages for collections workflows. VoucherVision offers promising strides towards efficient digitization, with curatorial staff playing critical roles in data quality assurance and process oversight. Emphasizing the importance of knowledge sharing, the University of Michigan Herbarium is backing the Specimen Label Transcription Project (SLTP), which will provide open access to benchmarking datasets, fine-tuned models, and validation tools to rank the performance of different methodologies, LLMs, and prompting strategies. In the rapidly evolving landscape of Artificial Intelligence (AI) development, we recognize the profound potential of diverse contributions and innovative methodologies to redefine and advance the transformation of curatorial practices, catalyzing an era of accelerated digitization in natural history collections. An early, public version of VoucherVision is available to try here: https://vouchervision.azurewebsites.net/
ISSN:2535-0897
2535-0897
DOI:10.3897/biss.7.113067