Language independent algorithm for clustering text documents with respect to their sentiment
Language independent algorithm for clustering text documents with respect to their sentiment
Author(s): Jerzy Korzeniewski, Adam IdczakSubject(s): Economy, Socio-Economic Research
Published by: Główny Urząd Statystyczny
Keywords: text mining; document sentiment; document clustering;
Summary/Abstract: Determining the sentiment of a written text is an important task in text research. This task can be performed either in the supervised or unsupervised version. In this paper, we propose a novel unsupervised algorithm for documents written in any language using documents written in Polish as an example. The clustering of Polish language texts with respect to their sentiment is poorly developed in the literature on the subject. The novelty of the proposed algorithm involves the abandonment of stoplists and lemmatisation. Instead, we propose translating all documents into English and performing a two-stage document grouping. In the first step of the algorithm, selected documents are assigned to a class of positive or negative documents based on a set of lexical and grammatical rules as well as a set of key-terms. Key-terms do not have to be entered by the user, the algorithm finds them. In the second step, the remaining documents are attached to one of the classes according to the rules based on the vocabulary found in the documents grouped in the first step. The algorithm was tested on three corpora of documents and achieved very good results.
Journal: Statistics in Transition. New Series
- Issue Year: 25/2024
- Issue No: 3
- Page Range: 175-185
- Page Count: 11
- Language: English