Applied Soft Computing
Cluster analysis is a broadly used unsupervised data analysis technique for finding groups of homogeneous units in a data set. Probabilistic distance clustering adjusted for cluster size (PDQ), discussed in this contribution, falls within the broad category of clustering methods initially developed to deal with continuous data; it has the advantage of fuzzy membership and robustness. However, a common issue in clustering deals with treating mixed-type data: continuous and categorical, which are among the most common types of data. This paper extends PDQ for mixed-type data using different dissimilarities for different kinds of variables. At first, the PDQ for mixed-type data is defined, then a simulation design shows its advantages compared to some state of the art techniques, and ultimately, it is used on a real data set. The conclusion includes some future developments.
San José State University
Fuzzy clustering, Mixed-type data, Probabilistic distance clustering
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Mathematics and Statistics
Cristina Tortora and Francesco Palumbo. "Clustering mixed-type data using a probabilistic distance algorithm[Formula presented]" Applied Soft Computing (2022). https://doi.org/10.1016/j.asoc.2022.109704