Eestikeelsete veebitekstide automaatne liigitamine
Classifying Estonian Web texts
Author(s): Kristiina Vaik, Kadri MuischnekSubject(s): Media studies, Classification, Computational linguistics, Finno-Ugrian studies, Philology
Published by: Eesti Rakenduslingvistika Ühing (ERÜ)
Keywords: corpus linguistics; automatic classification; natural language processing; machine learning; genre; corpus; Estonian;
Summary/Abstract: Due to the size of the Internet and the multitude of traditional and new genres there has been an increasing interest in automatic genre classification. Labelling texts in natural language processing is essential because this allows us to select more appropriate language models for the analysis. The aim of the article is to describe and present the results of automatically classifying Estonian Web 2013 texts. We evalued the quality of different classification models on our training and manually labelled test set. Most of the research on automatic classification has focused on classifying multiple genres, while our objective was to do a binary classification. We set out to classify Estonian Web 2013 texts based on whether they are canonical or not. For training we used the Balanced Corpus to represent canonical language and the New Media Corpus to represent non-canonical language. Due to the non-availability of a binary labelled subcorpus of Estonian Web 2013 texts, we compiled it ourselves by manually labelling it. For classification we used different supervised machine learning algorithms and for features a simple Bag of Words method. The results obtained from the preliminary experiments show that neural networks outperformed other machine learning algorithms achieving over 0.7 on accuracy. The overall results of this study indicate that in order to increase the accuracy of the classifiers, new features should be added (e.g POS count, sentences per paragraph, words per sentence, uppercase and lowercase letters per sentence etc.). Our best model, the neural network classifier, achieved an accuracy of 0.99 on a training set but only a little over 0.74 on the test set. This suggests that future work requiers a bigger and more appropriate training set. The manually labelling task showed us that the transition from canonical to non-canonical is very smooth. Current models produce a score between 0 and 1, defining if the item belongs to a class or not. Therefore, the classification models must be programmed to be more predictive so that the predictions can be tuned by selecting a threshold.
Journal: Eesti Rakenduslingvistika Ühingu aastaraamat
- Issue Year: 2018
- Issue No: 14
- Page Range: 215-227
- Page Count: 13
- Language: Estonian