The number of clusters in hybrid predictive models: does it really matter?
The number of clusters in hybrid predictive models: does it really matter?
Author(s): Mariusz Łapczyński, Bartłomiej JefmańskiSubject(s): Economy
Published by: Główny Urząd Statystyczny
Keywords: hybrid predictive model; k-means algorithm; decision trees
Summary/Abstract: For quite a long time, research studies have attempted to combine various analytical tools to build predictive models. It is possible to combine tools of the same type (ensemble models, committees) or tools of different types (hybrid models). Hybrid models are used in such areas as customer relationship management (CRM), web usage mining, medical sciences, petroleum geology and anomaly detection in computer networks. Our hybrid model was created as a sequential combination of a cluster analysis and decision trees. In the first step of the procedure, objects were grouped into clusters using the k-means algorithm. The second step involved building a decision tree model with a newindependent variable that indicated which cluster the objects belonged to. The analysis was based on 14 data sets collected from publicly accessible repositories. The performance of the models was assessed with the use of measures derived from the confusion matrix, including the accuracy, precision, recall, F-measure, and the lift in the first and second decile. We tried to find a relationship between the number of clusters and the quality of hybrid predictive models. According to our knowledge, similar studies have not been conducted yet. Our research demonstrates that in some cases building hybrid models can improve the performance of predictive models. It turned out that the models with the highest performance measures require building a relatively large number of clusters (from 9 to 15).
Journal: Przegląd Statystyczny. Statistical Review
- Issue Year: 66/2019
- Issue No: 3
- Page Range: 228-238
- Page Count: 11
- Language: English