The Digital Reading Room of the Ministry of Defense of the Czech Republic: Using Technology for Advanced Indexing of Historical Documents Cover Image

Digitální studovna Ministerstva obrany ČR: Využití technologií na pokročilou indexaci obsahu historických dokumentů
The Digital Reading Room of the Ministry of Defense of the Czech Republic: Using Technology for Advanced Indexing of Historical Documents

Author(s): Marek Fišer, Tomáš Kykal
Subject(s): History, Museology & Heritage Studies, Classification, Library operations and management, Electronic information storage and retrieval, Recent History (1900 till today), Source Material
Published by: AV ČR - Akademie věd České republiky - Ústav pro soudobé dějiny
Keywords: Czech Republic;Digital Reading Room of the Ministry of Defence of the Czech Republic;archives;libraries;Kramerius Digital Library;information technologies;semantic search;AI

Summary/Abstract: Developments in information technology and artificial intelligence are providing tools that have considerable potential to facilitate and enrich research in the fields of history and related sciences. A prerequisite for their effective use, however, is the most perfect conversion of analogue historical sources into machine-readable form, so that the search, classification and extraction of the information contained in them is as efficient as in born-digital sources. In their study, Kykal and Fišer first provide an overview of the development of digital libraries and the making available of the results of digitization in the Czech Republic, taking into account the different strategies and technological backgrounds of libraries and archives. They reflect on the limitations of full-text search and point out a surprising systemic deficit in current digital libraries, namely the absence of the diagnostics of the quality of machine transcription performed by Optical Character Recognition (OCR) programs. They then pay special attention to presenting the parameters and possibilities of the Digital Reading Room of the Ministry of Defence of the Czech Republic (Digitální studovna Ministerstva obrany ČR, DSMO), which is based on the Kramerius Digital Library system. Thanks to its role as an aggregator of the digitization production of the memory institutions of the Ministry of Defence, the Reading Room makes available both library documents and digitized items from archive collections and museum collections. Using the example of a printed periodical of the Austro-Hungarian Army from the First World War, the process of the additional enhancement of OCR results using the PERO tool (Czech abbreviation for pokročilá extrakce a rozpoznávání obsahu – Advanced Extraction and Recognition of Content) is presented, including enrichment with a metadata scheme which captures the layoutof graphic and text objects (Analysed Layout and Text Objects, ALTO) and allows the precise localization of the searched text on the digitized image. Using this program, the textual content of not only printed or typewritten texts, but also handwritten texts, can be retrieved much more efficiently and with noticeably higher quality. Moreover, the data in the ALTO scheme could be used to automatically monitor the quality of OCR results. This procedure would significantly increase the usability of semantic search, machine translation, summarization and many other artificial intelligence tools that are yet to be fully deployed in the Czech Digital Library environment.

  • Issue Year: XXXI/2024
  • Issue No: 2
  • Page Range: 447-467
  • Page Count: 21
  • Language: Czech
Toggle Accessibility Mode