Corpus of Ukrainian Dialect Texts (CorUDiT) as a Component of a Corpus of Texts of the Ukrainian Language (CTUL) Cover Image

Corpus of Ukrainian Dialect Texts (CorUDiT) as a Component of a Corpus of Texts of the Ukrainian Language (CTUL)
Corpus of Ukrainian Dialect Texts (CorUDiT) as a Component of a Corpus of Texts of the Ukrainian Language (CTUL)

Author(s): Olena Siruk
Subject(s): Language and Literature Studies
Published by: Wydział Polonistyki Uniwersytetu Warszawskiego
Keywords: dialekty ukraińskie; dialektologia; anotacja; transkrypcja; lematyzacja; dezambiguacja; Ukrainian dialects; dialectology; annotation; transcription; lemmatization; disambiguation

Summary/Abstract: The paper discusses the stages of development of the Corpus of Ukrainian Dialect Texts (CorUDiT) within the project aiming at compiling the Corpus of Texts of the Ukrainian Language (CTUL). In the first part, a comparative overview of applications of statistical, computational and corpus tools and techniques in Polish, Russian and German dialectology is presented. In the subsequent part of the article, the corpus’ guidelines, transcription types and text annotation principles are defined. The texts are being recorded in phonetic transcription, in an orthographic notation, as well as in a version constituting an approximation to the literary language which may later form the basis for automatic markup of the text and automatic morphological analysis.

  • Issue Year: 2012
  • Issue No: 63
  • Page Range: 257-270
  • Page Count: 14
  • Language: English