Multi-Class Text Classification on Khmer News Using Ensemble Method in Machine Learning Algorithms Cover Image

Multi-Class Text Classification on Khmer News Using Ensemble Method in Machine Learning Algorithms
Multi-Class Text Classification on Khmer News Using Ensemble Method in Machine Learning Algorithms

Author(s): Raksmey Phann, Chitsutha Soomlek, Pusadee Seresangtakul
Subject(s): ICT Information and Communications Technologies
Published by: Vysoká škola ekonomická v Praze
Keywords: Text classification; Khmer news; Machine learning; Feature extraction; Optimal hyperparameters; News categorization; Ensemble learning method

Summary/Abstract: The research herein applies text classification with which to categorize Khmer news articles. News articles were collected from three online websites through web scraping and grouped into nine categories. After text preprocessing, the dataset was split into training and testing sets. We then evaluated the performance of the ensemble learning method via machine learning classifiers with k-fold validation. Various machine learning classifiers were employed, namely logistic regression, Complement Naive Bayes, Bernoulli Naive Bayes, k-nearest neighbours, perceptron, support vector machines, stochastic gradient descent, AdaBoost, decision tree, and random forest were employed. Accuracy was improved for the categorization of Khmer news articles, in which Grid Search CV was used to find the optimal hyperparameters for each machine learning classifier with feature extraction TF-IDF and Delta TF-IDF. The results determined that the highest accuracy was achieved through the ensemble learning method in the support vector machine with the optimal hyperparameters (C = 10, kernel = rbf), using feature extraction TF-IDF and Delta TF-IDF, at 83.47% and 83.40%, respectively. The model establishes that Khmer news articles can be accurately categorized.

  • Issue Year: 12/2023
  • Issue No: 2
  • Page Range: 243-259
  • Page Count: 17
  • Language: English
Toggle Accessibility Mode