Generation of bilingual lexicons from a parallel corpus  Cover Image

Kakskeelsete leksikonide genereerimine paralleelkorpuse baasil
Generation of bilingual lexicons from a parallel corpus

Author(s): Kaarel Veskis
Subject(s): Language and Literature Studies
Published by: Eesti Rakenduslingvistika Ühing (ERÜ)
Keywords: corpus linguistics; automatically created dictionaries; bilingual lexicography; language processing; technical dictionaries; Estonian; English

Summary/Abstract: In addition to contrastive studies of languages or language variants, parallel/comparative corpora have many other uses both in theory and practice, while the potential of some of such uses is still awaiting discovery. One of the most interesting trends involves dictionary compilation or revision by means of extracting translation equivalents. The article attempts a survey of what has been done, with a view to some possible practical applications to Estonian in the future. For example, a simple lexicographic device has been outlined to enable the lexicographer to generate a list of translation equivalents by using a parallel text. Also, there have been reports of attempts to generate source material for an English-Estonian Estonian-English technical dictionary using not only the parallel corpus but also some free software, which needs little additional language resources beside the corpus material. For the time being multilingual technical dictionaries could be compiled from parallel corpuses only semiautomatically, because without intervention on the part of a human proofreader the method would yield but raw material to help lexicographers, terminologists or translation systems. There is no software that could perform paralleling and dictionary generation on the basis of the Estonian grammatical structure and its possible points of equivalence with the structure of some other language. Although the quality of the lexicon to be generated would certainly be improved by preliminary morphological analyis of the parallel corpus, our present attention has been focused on language independent approaches to dictionary extraction. The word aligners UWA and LWA developed by Swedish researchers within the Plug project (Tiedemann 2002) use relatively little language-specific information, which makes them easily applicable in automatic generation of dictionaries containing Estonian material. The article describes an attempt to develop source material for a technical dictionary by means of UWA and LWA, drawing on the English- Estonian parallel corpus of the University of Tartu and the English-Estonian subsection of the JRC-Aqcuis multilingual parallel corpus. One of the dictionaries so generated contains 130 865 headwords and 482 571 word forms. The precision of a random sample of 50 entries turned out to be 60%. In addition the article provides a survey of the working principles of the used programmes, and some suggestions on how to improve UWA results with a view to an analogous device to be possibly developed for Estonian.

  • Issue Year: 2007
  • Issue No: 3
  • Page Range: 355-372
  • Page Count: 18
  • Language: Estonian
Toggle Accessibility Mode