Administrative documents of the Don Cossack Host in the 18th – 19th centuries: the issue of the creation of a linguistic corpus
Administrative documents of the Don Cossack Host in the 18th – 19th centuries: the issue of the creation of a linguistic corpus
Author(s): Oksana Anatolevna Gorban, Marina Kosova, Elena Mihajlovna Sheptukhina, Andrey Svetlov, Anatoly Komendantov, Alexander Matveev, Daniil FilimonovSubject(s): Language and Literature Studies, Theoretical Linguistics, Applied Linguistics, Computational linguistics, Philology
Published by: Институт за литература - БАН
Keywords: diachronic linguistic corpus; administrative documents; Don Cossack Host; meta-tagging; morphological tags
Summary/Abstract: The article presents basic principles of designing the diachronic linguistic corpus of documents of the Don Cossack Host offices from the State Archive of the Volgograd region, Russia, including collecting documents for the text corpus, arranging the technical base of automatic processing and text editing, scheduling automated tagging, morphological annotation, and corpus software tools. The authors explain some technical aspects of corpus processing and text corpus constituency. It is considered reasonable to add any document to the corpus, the draft texts with the crossed-out fragments included, as it ensures accurate registration of grammar and vocabulary of the language at a certain historical period. A set of language marker types is worked over for automated meta-tagging. The corpus software tools are defined to enable accurate annotation of obsolete fonts so that they can be processed in a pair with regular language units and expressions in morphological and genre meta-tagging; in cases of partial text adaptation, the authentic old graphic symbols may have to be preserved.
Journal: Scripta & e-Scripta
- Issue Year: 2021
- Issue No: 21
- Page Range: 139-150
- Page Count: 12
- Language: English
- Content File-PDF