The Balanced Corpus of Modern Latvian, its role in grammar studies Cover Image

Līdzsvarotais mūsdienu latviešu valodas tekstu korpuss, tā nozīme gramatikas pētījumos
The Balanced Corpus of Modern Latvian, its role in grammar studies

Author(s): Kristine Levane-Petrova
Subject(s): Morphology, Historical Linguistics, Computational linguistics, Baltic Languages
Published by: Latvijas Universitātes Akadēmiskais apgāds
Keywords: corpus; general corpus; balance; representativeness; metadata; text selection criteria;

Summary/Abstract: The main purpose of this paper is to present „The Balanced Corpus of Modern Latvian” (LVK) (www.korpuss.lv) – a new 10 million representative corpus of contemporary Latvian. It describes the design, composition and text selection criteria of LVK2018. Also the annotation of the corpus (the metadata and the morphological tagging) and the usage of the corpus is described in the paper. The history of the LVK series goes back to the 2007 when the first 1 million corpus was created. The LVK design, compilation and the text selection criteria were based on the Latvian Language Corpus Conception. The same corpus design criteria were also used for the subsequent LVK series. The last corpus from that series (LVK2013) was released on 2013 with 4.5 million words. All corpora are morphologically annotated and the texts also annotated with metadata. LVK2018 is the 10 million representative corpus of contemporary Latvian. LVK2018 is enlarged from LVK2013 based on the slightly modified corpus design criteria that also applied for the previous corpora from LVK series. LVK2018 is designed as general-language, representative and balanced corpus that aims to cover the variety of existing texts in some estimated proportions. The corpus contains five different sections: journalism (60%), fiction (20%), scientific (10%), legal (8%), parliamentary transcripts (2%). This work has received financial support from European Regional Development Fund under the grant agreement No. 1.1.1.1/16/A/219 (Full Stack of Language Resources for Natural Language Understanding and Generation in Latvian).

  • Issue Year: 2019
  • Issue No: 10
  • Page Range: 131-146
  • Page Count: 16
  • Language: Latvian