EESTI KEELE &#220;HENDKORPUSTE SARI 2013–2021: MAHUKAIM EESTIKEELSETE DIGITEKSTIDE KOGU

EESTI KEELE ÜHENDKORPUSTE SARI 2013–2021: MAHUKAIM EESTIKEELSETE DIGITEKSTIDE KOGU
ESTONIAN NATIONAL CORPUS 2013–2021: THE LARGEST COLLECTION OF ESTONIAN LANGUAGE DATA

Author(s): Kristina Koppel, Jelena Kallas
Subject(s): Morphology, Lexis, Computational linguistics, Finno-Ugrian studies, Present Times (2010 - today), Philology
Published by: Eesti Rakenduslingvistika Ühing (ERÜ)
Keywords: Estonian National Corpus; corpora; corpus lexicography; corpus query system; Estonian;

Summary/Abstract: The paper describes the Estonian National Corpus 2021 (Estonian NC 2021), the latest and the largest edition in the Estonian National Corpora series. The entire series of Estonian NC consists of four corpora: Estonian NC 2013, 2017, 2019 and 2021. The series was compiled by the Institute of the Estonian Language in cooperation with the software company Lexical Computing Ltd. All corpora are accessible through the Sketch Engine interface, a corpus query system developed and maintained by Lexical Computing Ltd. The data are also stored in the repository Entu at Center of Estonian Language Resources. The Estonian National Corpus 2021 contains eleven sub-corpora (i.e. Web 2013, Web 2017, Web 2019, Web 2021, Feeds 2014-2021, Wikipedia 2021, Wikipedia Talk 2017, the Open Access Journals (DOAJ), Literature, the Balanced Corpus, and the Reference Corpus) totalling 2.4 billion words. In addition, the corpus is divided into genres and topics. The most extensive part of the Estonian NC 2021 is the Estonian Web Corpora, i.e. texts crawled from the web. In the paper, we outline the process of crawling the web, the process of cleaning and post-processing the crawled data, and the methodology for classifying web texts into genres and topics. We also introduce new tools for the analysis of corpus data in Sketch Engine, and suggest further perspectives and needs for corpus development.

Details
Contents

Journal: Eesti Rakenduslingvistika Ühingu aastaraamat

Issue Year: 2022
Issue No: 18
Page Range: 207-228
Page Count: 22
Language: Estonian

Content File-PDF

Back to list

EESTI KEELE ÜHENDKORPUSTE SARI 2013–2021: MAHUKAIM EESTIKEELSETE DIGITEKSTIDE KOGU ESTONIAN NATIONAL CORPUS 2013–2021: THE LARGEST COLLECTION OF ESTONIAN LANGUAGE DATA

EESTI KEELE ÜHENDKORPUSTE SARI 2013–2021: MAHUKAIM EESTIKEELSETE DIGITEKSTIDE KOGU
ESTONIAN NATIONAL CORPUS 2013–2021: THE LARGEST COLLECTION OF ESTONIAN LANGUAGE DATA