Automatisierte Identifikation und Lemmatisierung historischer Berufsbezeichnungen in deutschsprachigen Datenbeständen

Occupational information occurs in many historical sources. For a large number of research areas, not only standardization, but above all classification of these is a central prerequisite for analysis. In this article, the assignment of spelling variants to already defined generic names of occupatio...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Zeitschrift für digitale Geisteswissenschaften 2022-03, Vol.7
Hauptverfasser: Jan Michael Goldberg, Katrin Moeller
Format: Artikel
Sprache:ger
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Occupational information occurs in many historical sources. For a large number of research areas, not only standardization, but above all classification of these is a central prerequisite for analysis. In this article, the assignment of spelling variants to already defined generic names of occupations is referred to as lemmatization or normalisation, while the assignment of the normalised spelling and to a classification system is referred to as classification. In order to reduce manual effort, an algorithm for the automated lemmatization of historical, German-language occupational data is developed. The best result is achieved with a supervised machine learning approach. Overall, about 72 percent of the occupational data can be lemmatized, and about 98 percent of these assignments are correct.
ISSN:2510-1358
DOI:10.17175/2022_002_v2