New Slovene Corpora within the Communication In Slovene Project
New Slovene Corpora within the Communication In Slovene Project
Author(s): Nataša Logar-Berginc, Simon KrekSubject(s): Language and Literature Studies
Published by: Wydział Polonistyki Uniwersytetu Warszawskiego
Keywords: język słoweński; korpus referencyjny; korpus języka mówionego; anotacja; licencja Creative Commons; Slovene language; reference corpus; corpus of spoken language; annotation; Creative Commons licence
Summary/Abstract: The paper presents three publicly available corpora of contemporary Slovene: a) a monolingual dynamic corpus of written language Gigafida (1 billion words); b) a balanced subcorpus of written language KRES (100 million words); c) a reference corpus of spoken Slovene GOS (1 million words). The spoken and written data has been compiled since 2008. The billion-word corpus has already been compiled. The corpus is lemmatized and morpho-syntactically tagged, as well as partly syntactically annotated. All sorts of language features may be retrieved from it – syntactic and semantic information, as well as phraseology. Moreover, the corpus constitutes a basis for a lexical database and a modern corpus-based grammar, both of which are being developed within the project. The larger corpus is the foundation of a balanced subcorpus of the written language. The paper compares the main features of the two corpora, describes the taxonomy and specifies the number and types of texts included in the corpora. The million-words spoken language corpus compiled within the project has also been taken into account. The paper discusses demographic and text genre criteria for text collection (pedagogical discourse, public, formal, monologue, dialogue), as well as rules of transcription. Finally, tools for linguistic annotation developed within the project and made available to the public have been enumerated.
Journal: Prace Filologiczne
- Issue Year: 2012
- Issue No: 63
- Page Range: 197-208
- Page Count: 12
- Language: English