Clustering Mixed-Type Data: A Benchmark Study on KAMILA and K-Prototypes

Publication Date

1-1-2021

Document Type

Conference Proceeding

Publication Title

Studies in Classification, Data Analysis, and Knowledge Organization

Volume

5

DOI

10.1007/978-3-030-60104-1_10

First Page

83

Last Page

91

Abstract

Benchmarking in cluster analysis is the process of analyzing which clustering techniques give the best result for different types of data structures as well as setting a standard for evaluation of newer clustering methods. There are many instances of benchmarking in cluster analysis for continuous data, but only a few for mixed-type data, i.e. data sets with nominal and continuous variables. Therefore, we explore the process for benchmarking various clustering methods on simulated mixed-type data sets with varying proportions of continuous and nominal variables. For this purpose, we test a newer clustering algorithm, KAMILA, against K-prototypes and tandem analysis where data are preprocessed using multiple correspondence analysis and then clustered using K-means, fuzzy K-means, probabilistic distance clustering (PD), and Student-t mixture models.

Keywords

K-prototypes, KAMILA, Mixed-type data clustering, Multiple correspondence analysis

Department

Mathematics and Statistics

Share

COinS