The problems of word class disambiguation in the automatic analysis of Estonian Cover Image

Sõnaliigituse kitsaskohad eesti keele arvutianalüüsis
The problems of word class disambiguation in the automatic analysis of Estonian

Author(s): Kadri Vider, Kadri Muischnek
Subject(s): Language and Literature Studies
Published by: Eesti Rakenduslingvistika Ühing (ERÜ)
Keywords: korpuslingvistika; tekstikorpused; sõnaliigid; morfoloogiline ühestamine; sõnatähenduste ühestamine; eesti keel ; word class disambiguation ; the automatic analysis of Estonian ; estonian linguistics

Summary/Abstract: This article presents a new language resource – a morphologically annotated text corpus and discusses some linguistic problems that rose during the process of manual morphological disambiguation. The research group of computational linguistics of the University of Tartu has developed a morphologically disambiguated corpus of Estonian. This work was supported by the national program “Eesti keel ja rahvuskultuur” (“Estonian Language and Culture”). During that project 500 000 running words were manually morphologically disambiguated. The disambiguated texts belong to the following text classes: newspaper texts (100 000 words), fiction (100 000 words), legal texts (100 000 words), texts from the scientific magazine “Horisont” and 100 000 words of transcribed spoken language texts. The disambiguated (written language) texts are available at the home page of the Research group for computational linguistics . In this article we describe the tagset and process of the manual disambiguation, but the focus is on the linguistic problems that the annotators encountered during the process of manual disambiguation. Although in some rare cases a human annotator had difficulties i.e. in determing the case form of a noun the main difficulties were encountered in the domain of wordclass disambiguation. The borderlines between wordclasses are known to be fuzzy, in Estonian even the closed wordclasses are not really closed as new pre- and postpositions (as well as adverbs) are constantly developing from inflectional forms of nouns and verbs – a phenomenon that is of extreme interest for a linguist but most depressing for a human morphological annotator. The morphologically disambiguated texts are used as an input for word sense disambiguation (in addition to many other applications). The word sense disambiguation deals only with content words, so the exact border between the inflectional forms of nouns and verbs on the one hand and the adpositions on the other is important for this further task.

  • Issue Year: 2005
  • Issue No: 1
  • Page Range: 099-114
  • Page Count: 16
  • Language: Estonian
Toggle Accessibility Mode