The IMPACT Project Polish Ground-Truth Texts as a DjVu Corpus Cover Image

The IMPACT Project Polish Ground-Truth Texts as a DjVu Corpus
The IMPACT Project Polish Ground-Truth Texts as a DjVu Corpus

Author(s): Janusz S. Bień
Subject(s): Language and Literature Studies, Library and Information Science, Electronic information storage and retrieval, Theoretical Linguistics
Published by: Instytut Slawistyki Polskiej Akademii Nauk
Keywords: Polish language; corpora; DjVu; OCR; PAGE; Page Analysis and Ground-Truth Elements; GNU GPL

Summary/Abstract: The purpose of the paper is twofold. First, to describe the already implemented idea of DjVu corpora, i.e. corpora which consist of both scanned images and a transcription of the texts with the words associated with their occurrences in the scans. Secondly, to present a case study of a corpus consisting of almost 5 000 pages of Polish historical texts dating from 1570 to 1756 (it is practically the very first corpus of historical Polish). The tools described have universal character and are freely available under the GNU GPL license, hence they can be used also for other purposes.

  • Issue Year: 2014
  • Issue No: 14
  • Page Range: 75-84
  • Page Count: 10
  • Language: English
Toggle Accessibility Mode