Lietuvių–latvių ir latvių–lietuvių kalbų lygiagretusis tekstynas LILA
Lithuanian-Latvian, Latvian-Lithuanian Parallel Corpus (LILA)
Author(s): Erika Rimkutė, Kristine Levane-Petrova, Andrius UtkaSubject(s): Language and Literature Studies
Published by: Kauno Technologijos Universitetas
Keywords: lygiagretusis tekstynas; lietuvių kalba; latvių kalba; baltų kalbos; mažai išteklių turinčios kalbos
Summary/Abstract: The paper presents a new linguistic resource, LILA, which is the Lithuanian-Latvian-Lithuanian parallel corpus aligned on paragraph and sentence level. The total size of the LILA corpus is 9 m words. So far it is a unique resource for this language pair. The corpus contains metadata with bibliographical information (title, author, year of publishing, etc.). The corpus contains the structural annotation, which includes boundaries of aligned segments, paragraphs, and sentences. The alignment of paragraphs and sentences has been done by the semi-automatic alignment tool Aligner 2.0.6.7. The corpus was compiled during 2011-2012 by scientists of the Vytautas Magnus University’s Centre of Computational Linguistics (VMU CCL) and the Latvian University’s Mathematical and Informatics Institute’s Laboratory of Artificial Intelligence (LU MII). The paper describes problems and challenges that need to be solved, when a parallel corpus for two small languages is created. The limited choice of appropriate parallel material poses the most difficult obstacle, as then it is difficult to compile a corpus of desired size. The paper presents: the conception and structure of the LILA corpus, phases of its compilation, the alignment tool, the query system, and examples of usage. The corpus is especially useful for teaching and learning languages, for comparing languages, for compilation of dictionaries, and for developing language technology tools (e. g. statistical machine translation systems).
Journal: Kalbų Studijos
- Issue Year: 2013
- Issue No: 23
- Page Range: 70-77
- Page Count: 8
- Language: Lithuanian