Möistus sai kuulotedu: 19. sajandi vallakohtuprotokollide tekstidest digitaalse ressursi loomine
Creating a digital resource from 19th century communal court minute books
Author(s): Tõnis Türna, Siim Orasmaa, Kersti Lust, Liina Lindström, Gerth Jaanimäe, Kadri Muischneki, Maarja-Liisa PilvikuSubject(s): Media studies, Morphology, 19th Century
Published by: Eesti Rakenduslingvistika Ühing (ERÜ)
Keywords: natural language processing; automatic morphology; digital humanities; corpus linguistics; databases; language history; Estonian;
Summary/Abstract: This article describes an interdisciplinary attempt to create a digital resource from Estonian communal court minute books dating from 1866−1890, with the focus lying on using contemporary natural language processing tools for analyzing archaic language. The database contains nearly 420 000 tokens in XML-tagged files. The texts are linguistically diverse: the parallel use of old and new spelling systems, dialects, and the background of the parish clerk bring about a lot of language variation. There are also differences in the orthographic choices made during the manual insertion of the texts. For the purpose of linguistic analysis and tagging, automatic morphological analysis and named entity recognition was tested using EstNLTK libraries. A closer examination of the output suggested that it is necessary to use both text normalization and tool adaption for improving the quality of automatic analyses. This would result in tools, which would perform better at analyzing similar texts and which could, therefore, be applied in the automatic analysis crowd-sourced material. Making the communal court minute books accessible and searchable by supplying linguistic and topical information creates a rich digital resource which is subject of interest for many disciplines.
Journal: Eesti Rakenduslingvistika Ühingu aastaraamat
- Issue Year: 2019
- Issue No: 15
- Page Range: 139-158
- Page Count: 20
- Language: Estonian