COMPARISON OF CLASSIFICATION RESULTS OF SMO AND J48 ALGORITHMS ON DIFFERENT DATA SETS Cover Image

FARKLI VERİ SETLERİ ÜZERİNDE SMO VE J48 ALGORİTMALARININ SINIFLANDIRMA SONUÇLARININ KARŞILAŞTIRILMASI
COMPARISON OF CLASSIFICATION RESULTS OF SMO AND J48 ALGORITHMS ON DIFFERENT DATA SETS

Author(s): Mehmet ali Alan, Cavit Yeşilyurt
Subject(s): Social Sciences
Published by: Sakarya üniversitesi
Keywords: Data Mining; Classification; SMO; J48;

Summary/Abstract: The data sources of institutions, social media shares, articles on websites and forms provide large amounts of data. It is very difficult to process large amounts of data in traditional ways and to produce information for use in decision processes.In this context, data mining can provide the production of the information needed from the available data with the advanced techniques that it offers.Databases are rich in confidential information that will enable rational decision-making. Classification and estimation are two important data analysis techniques used for estimating future data trends or explaining important data classes. These analyzes can be useful in better understanding of large amounts of data. Today, institutions produce large amounts of data, but they have difficulties in revealing meaningful and useful information within these data. It is not easy to analyze large data with traditional statistical methods. Special methods are therefore required to process and analyze data. Data mining methods have emerged to meet this requirement.The aim of this study is to compare the performances of the SMO and J48 algorithms used in the classification of data mining. For this purpose, data mining was performed by using three different student data sets.Data mining is an analysis method that summarizes data and exposes hidden relationships with both useful and understandable data, in unusual ways. This method is one of the processes of knowledge discovery in the database, which first explores scientific and technical data to reveal unknown patterns. Classification is a process that is frequently used in daily life. By classification, the objects are split and separated, that is, each of the mutually exclusive or general categories can be assigned as a class. Many practical decision-making processes can be formulated as a classification problem. For example, people or objects can be one of many categories. Classification is the process of assigning different elements in different classes. These classes may be business rules, class boundaries, or some mathematical functions. The classification process can be constructed on a relationship between a class of the classified element and a known class value and properties. This type of classification is called “supervised learning”. If there are no known examples of a class, this classification is unsupervised. The most common uncontrolled classification approach is clustering. The most common applications of clustering technology are retail basket analysis and fraud detection.The concept of controlled learning in data mining is to teach a classification function on the basis of known data with a classification or to construct a classification model. This function or model converts data from the database into target attributes, so new data can be used in class estimation. The data mining system relates to areas such as spatial data analysis, information retrieval, model recognition, image analysis, signal processing, computer graphics, web technology, economics, business, bioinformatics or psychology, depending on the types of data to be mining or the specific data mining application.SMO (Sequential Minimal Optimization) is a simple algorithm that can quickly solve the SVM QP problem without any extra matrix storage and without using numerical QP optimization steps. SMO chooses to solve the smallest possible optimization problem at every step. The smallest possible optimization problem for the standard SVM QP problem involves two Lagrange multipliers because the Lagrange multipliers must comply with a linear equality constraint. At each step, the SMO selects two Lagrange multipliers to jointly optimize it, finds the most appropriate values ​​for these multipliers and updates the SVM to reflect the new optimal values. The advantage of SMO lies in the fact that the analysis of two Lagrange multipliers can be done analytically. Thus, numerical QP optimization is completely prevented. Although more optimization sub-problems are solved during the algorithm, each sub-problem is so fast that the general QP problem is solved quickly. Furthermore, SMO does not require any additional matrix storage. Therefore, very large SVM training problems can fit into the memory of an ordinary personal computer or workstation. SMO is less sensitive to numerical sensitivity problems since no matrix algorithm is used.J48 is a decision tree algorithm based on the very popular C4.5 algorithm developed by J. Ross Quinlan. Decision trees are a classic way of representing information from a machine learning algorithm and provide a powerful and fast way to express data structures. This algorithm classifies the data recursively. This ensures the maximum accuracy of the training data, but it can only create extreme rules that define the specific behavior characteristics of the data. J48 Algorithm; Based on the Information Gain Theory, it has the ability to automatically process the data to select the relevant properties. It is the iterative algorithm that divides the samples from the point where information gain is the best. The tree structure starts with the process of dividing the subjects and selecting the best root variable of the tree and building it from top to bottom. The J48 is able to perform an effective pruning process to cut weak branches, which is not meaningful. One of the reasons is that the purpose of decision trees is not to discover data, but to create a simple classification model on the data.In this study, three different data sets of university students were used. The data were subjected to the necessary regulations using Excel macros and data warehouses were prepared. After making the necessary conversions, the data is printed in the text file “iibf1.arff ”, “iibf2.arff” and “myo.arff”. In the study, the WEKA Program (Waikato Environment for Knowledge Analysis) version 3.7.2 developed by the University of Waikato was used. For each data set, the student's gender, province, family income level, the number of siblings, number of siblings studying, and entry point were taken as qualifications. The degree of entry score is used in the class definitions.According to the data results, the success rate of the SMO algorithm in the classification is higher compared J48 algorithm, making this algorithm more reliable.

  • Issue Year: 6/2018
  • Issue No: 3
  • Page Range: 199-213
  • Page Count: 15
  • Language: Turkish
Toggle Accessibility Mode