Fast, Memory-Efficient Spectral Clustering with Cosine Similarity

Publication Date

1-1-2024

Document Type

Conference Proceeding

Publication Title

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Volume

14469 LNCS

DOI

10.1007/978-3-031-49018-7_50

First Page

700

Last Page

714

Abstract

Spectral clustering is a popular and effective method but known to face two significant challenges: scalability and out-of-sample extension. In this paper, we extend the work of Chen (ICPR 2018) on the speed scalability of spectral clustering in the setting of cosine similarity to deal with massive or online data that are too large to be fully loaded into computer memory. We start by assuming a small batch of data drawn from the full set and develop an efficient procedure that learns both the nonlinear embedding and clustering map from the sample and extends them easily to the rest of the data as they are gradually loaded. We then introduce an automatic approach to selecting the optimal value of the sample size. The combination of the two steps leads to a streamlined memory-efficient algorithm that only uses a small number of batches of data (as they become available), with memory and computational costs that are independent of the size of the data. Experiments are conducted on benchmark data sets to demonstrate the fast speed and excellent accuracy of the proposed algorithm. We conclude the paper by pointing out several future research directions.

Keywords

Cosine similarity, Memory scalability, Spectral clustering, Speed scalability

Department

Mathematics and Statistics

Share

COinS