Skanowane teksty jako korpusy
Scanned Texts as Corpora
Author(s): Janusz S. BieńSubject(s): Language and Literature Studies
Published by: Wydział Polonistyki Uniwersytetu Warszawskiego
Keywords: skanowanie; tekst; korpus; wyszukiwarka; kodowanie; scanning; text; corpus; search tool; coding
Summary/Abstract: A modification of the Poliqarp corpus search tool is described, which is oriented towards searching scanned texts with dirty OCR (i.e. the fully automatic Optical Character Recognition without any proofreading). This search tool operates since December 2009 and is available at http://wbl.klf.uw.edu.pl/. The twolevel regular expressions, which can be used in the queries, allow – at least in principle – to circumvent the OCR errors. The crucial property of the search engine is its ability to highlight the hits on the original scans stored in the DjVu format. Although the feature is not original, as it has been used for the first time for the Century Dictionary and later for Jamieson’s Etymological Dictionary of the Scottish Language, it is substantially augmented by allowing the socalled graphical concordances and providing a convenient way to bookmark the hits. Our system handles now four dictionaries, with the total size of over 40,000 pages. It is expected that in the near future other texts will be added to the system.
Journal: Prace Filologiczne
- Issue Year: 2012
- Issue No: 63
- Page Range: 25-36
- Page Count: 12
- Language: Polish