Language independent algorithm for clustering text documents with respect to their sentiment

Language independent algorithm for clustering text documents with respect to their sentiment
Language independent algorithm for clustering text documents with respect to their sentiment

Author(s): Jerzy Korzeniewski, Adam Idczak
Subject(s): Economy, Socio-Economic Research
Published by: Główny Urząd Statystyczny
Keywords: text mining; document sentiment; document clustering;

Summary/Abstract: Determining the sentiment of a written text is an important task in text research. This task can be performed either in the supervised or unsupervised version. In this paper, we propose a novel unsupervised algorithm for documents written in any language using documents written in Polish as an example. The clustering of Polish language texts with respect to their sentiment is poorly developed in the literature on the subject. The novelty of the proposed algorithm involves the abandonment of stoplists and lemmatisation. Instead, we propose translating all documents into English and performing a two-stage document grouping. In the first step of the algorithm, selected documents are assigned to a class of positive or negative documents based on a set of lexical and grammatical rules as well as a set of key-terms. Key-terms do not have to be entered by the user, the algorithm finds them. In the second step, the remaining documents are attached to one of the classes according to the rules based on the vocabulary found in the documents grouped in the first step. The algorithm was tested on three corpora of documents and achieved very good results.

Details
Contents

Journal: Statistics in Transition. New Series

Issue Year: 25/2024
Issue No: 3
Page Range: 175-185
Page Count: 11
Language: English

Content File-PDF

Back to list