Author

Li Miao

Publication Date

Fall 2015

Degree Type

Master's Project

Department

Computer Science

Abstract

With the overwhelming amount of data pouring into our lives, obtaining meaningful information from them is becoming a must task for people. How can people mine for "gold" in this area? Or, what tools can they use to do that? It has been proved that clustering is one of the best tools. In this project, two clustering algorithms are studied and numerically compared with various data sets. The first one is the K-means clustering which starts with initial roughly-guessed clusters, tries to classify some data points into one cluster, and iteratively repeats until converges. The second algorithm is called Fast Search and Find of Density Peaks (FSDP), which is able to automatically detect the correct number of clusters according to the inherent property of its decision graph. It is based on the following assumptions: 1) the viii cluster centers have higher density than their neighbor data points; 2) the distance between the cluster center and any data points with higher local density is relatively large. Its decision graph is a graphic and intuitive expression for the clustering. One may get more or fewer clusters, by setting smaller or larger thresholds. The two algorithms are described in the following chapters. They are implemented in Java. To compare how well they perform on four milestone data sets, we use two metrics: Entropy and Purity. The results demonstrate that Kmeans clustering is faster, while the FSDP is more accurate.

Share

COinS