AlCo – A ONE HUNDERT MILLION WORD CORPUS OF ALBANIAN Cover Image

AlCo – NJË KORPUS TEKSTESH I GJUHËS SHQIPE ME NJËQIND MILIONË FJALË
AlCo – A ONE HUNDERT MILLION WORD CORPUS OF ALBANIAN

Author(s): Besim Kabashi
Subject(s): Language studies, Language and Literature Studies, Morphology, Syntax
Published by: Univeristeti i Prishtinës, Fakulteti i Filologjisë
Keywords: Albanian Corpus (AlCo) ;

Summary/Abstract: It is impossible to do serious studies in the field of natural languages without consulting empirical data. Natural language corpora offer this data in its original form. Apart from the fact that in some cases a corpus does not cover every possible word of a language, e. g. evidence, spelling or definition dictionary does, it offers the possibility to explore the data in every imaginable form, e. g. based on the actual context. A big corpus offers more data and the possibilities to cover more words and more phenomena than a small corpus does. Corpora make it possible to accurately study language in a quality not possible without empirical data. To use these benefits from corpora, it is necessary to create them first. We present an Albanian Corpus (AlCo) that contains a hundred million word tokens (text words), the first Albanian corpus of this size. The corpus covers different domains of language and contains different text types – it is a reference corpus. At this moment the work is still in progress, some texts still need to be replaced or recategorized. The corpus is annotated with a morpho-syntactic tagset of 77 tags, since 2015. We use CQPweb, a web-based corpus analysis system, to explore the corpus data.

  • Issue Year: 2017
  • Issue No: 36
  • Page Range: 123-132
  • Page Count: 10
  • Language: Albanian