PolUKR (A Polish-Ukrainian Parallel Corpus) as a Testbed for a Parallel Corpora Toolbox
PolUKR (A Polish-Ukrainian Parallel Corpus) as a Testbed for a Parallel Corpora Toolbox
Author(s): Natalia KotsybaSubject(s): Language and Literature Studies
Published by: Wydział Polonistyki Uniwersytetu Warszawskiego
Keywords: korpus równoległy; język polski; język ukraiński; narzędzia korpusowe; parallel korpus; Polish; Ukrainian; corpus tools
Summary/Abstract: The paper examines the PolUKR project (2005–2010), whose aim was to compile an experimental Polish-Ukrainian parallel corpus. The language processing tools developed within the project may prove useful to linguists who are interested in creating and employing parallel or problem-tailored monolingual corpora but lack a sufficient technical background for developing their own software. The article covers the history of the project and describes the structure of the Polish-Ukrainian parallel corpus. It also briefly presents general guidelines (“a roadmap”) for developing parallel corpora, which may be applied by linguists not involved in the project, and discusses the problem of adjusting the corpus to the existing tools and software. Finally, the author presents software developed within the project, in particular the UGTag tagger for Ukrainian and a sentence splitter, both of which employ Polish and Ukrainian abbreviations, and thus are language-specific. The PLUczeK editor, however, may be applied to any pair of languages.
Journal: Prace Filologiczne
- Issue Year: 2012
- Issue No: 63
- Page Range: 181-196
- Page Count: 16
- Language: English