Using Handwritten Text Recognition on bilingual Evenki-Russian manuscripts of Konstantin Rychkov
Using Handwritten Text Recognition on bilingual Evenki-Russian manuscripts of Konstantin Rychkov
Author(s): Alexandre Arkhipov, Anna Barinskaya, Roman ShtefuraSubject(s): Language and Literature Studies, Theoretical Linguistics, Applied Linguistics, Computational linguistics, Philology
Published by: Институт за литература - БАН
Keywords: Transkribus; PyLaia; Russian; Evenki; bilingual manuscript
Summary/Abstract: We report on applying Handwritten Text Recognition (HTR) to manuscripts from the archive of Konstantin Rychkov preserved at IOM RAS, St. Petersburg, within the INEL project. Folklore texts in Evenki (Tungusic) were collected in Western Siberia in 1910s. We used services provided by the Transkribus platform. The necessary step of Layout Analysis proved to be time-consuming due to the organization of the parallel Evenki-Russian text on the page without following a strict separation line. HTR models have been trained successively on different amounts of data up to 521 pages. The best Character Error Rate attained on validation data for the largest dataset is 4.50% for models trained on all characters. The distribution of errors is non-uniform: most errors are due to just a few problematic issues, especially diacritics such as the accent marking stress. It is written high above the line and frequently cut off from the line images at the preprocessing stage. After excluding the stress mark from training data and recognition, the lowest CER dropped to 2.90%. We compared two recognition engines, HTR+ and PyLaia. The HTR+ model trained without stress marks made less errors in letters, while PyLaia performed better with respect to diacritics.
Journal: Scripta & e-Scripta
- Issue Year: 2021
- Issue No: 21
- Page Range: 233-244
- Page Count: 12
- Language: English
- Content File-PDF