Digitizing a Large Corpus of Handwritten Documents Using Crowdsourcing and Cultural Consensus Theory
We investigated using internet-based procedures to convert information from a large handwritten archive of ethnographic survey data into a computer addressable database. Rather than manually transcribing the archive's estimated 23,000 pages of handwritten data, we sought to develop novel crowds...
Gespeichert in:
Veröffentlicht in: | International journal of internet science 2016-01, Vol.11 (1), p.8 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | We investigated using internet-based procedures to convert information from a large handwritten archive of ethnographic survey data into a computer addressable database. Rather than manually transcribing the archive's estimated 23,000 pages of handwritten data, we sought to develop novel crowdsourcing task designs, and to use an innovative variation of Cultural Consensus Analysis (CCT) to objectively aggregate crowdsourced responses based on a formal process model of shared knowledge. Experiment 1used simulated internet-based tasks conducted on human subject pool participants in a university laboratory. Experiment 2 used a similar design with the exception that it was implemented on an internet-based research platform (i.e., Amazon Mechanical Turk). Results from these investigations shed light on several uncertainties concerning the utility of CCT analyses with crowdsourced transcription data. For example, they clarify (1) whether crowdsourced tasks are practical as a method for automating the transcription of the archive's handwritten material, (2) whether responses from perceptually-based tasks inherent to transcribing handwritten documents can be analyzed using CCT, and (3) if CCT is appropriate as a model of the transcription challenge, then do the results produce accurate answer-key estimates that could serve as correct transcriptions of the archive's data. Our results address these issues and convey how CCT modeling can be modified and made appropriate for aggregating such data. Implications of these analyses and uses of CCT in large-scale crowdsourced data collection platforms are discussed. |
---|---|
ISSN: | 1662-5544 1662-5544 |