Author

Ayush Koul

Publication Date

Spring 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Fabio Di Troia

Second Advisor

Sayma Akther

Third Advisor

William Andreopoulos

Keywords

Malware, Clustering, Embedding, Word2Vec, FastText, Doc2Vec, DBSCAN, K-Means, Gaussian Mixture, Agglomerative, BIRCH

Abstract

Malware detection and classification remain critical challenges in cybersecurity, especially as malicious software becomes increasingly sophisticated and prevalent. While much of the work involving embeddings has traditionally relied on supervised learning approaches, there is significant potential in leveraging unsupervised learning techniques to discern hidden structures in malware data. By employing embedding techniques to convert malware samples into high-dimensional vector representations, we can capture the subtle and complex patterns inherent in malicious code without relying on pre-labeled data. This unsupervised approach helps categorize malware into predefined malware families, greatly aiding in developing cybersecurity solutions. In contrast to traditional supervised models that depend heavily on historical data and predefined labels, unsupervised learning facilitates the discovery of novel and

previously unseen malware variants. This research investigates a range of embed- ding methods, including Word2Vec, FastText, and Doc2Vec, paired with various

clustering techniques such as DBSCAN, K-Means, Gaussian Mixture, Agglomerative, and BIRCH. The objective is to comprehensively analyze the combined impact of these methods on malware detection in an unsupervised setting. By shifting the focus towards unsupervised learning, this paper highlights the potential to capture malware’s dynamic and evolving nature, ultimately contributing to more adaptive and resilient cybersecurity strategies.

Available for download on Monday, May 25, 2026

Share

COinS