Publication Date
Spring 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Fabio Di Troia
Second Advisor
Sayma Akther
Third Advisor
William Andreopoulos
Keywords
Malware, Clustering, Embedding, Word2Vec, FastText, Doc2Vec, DBSCAN, K-Means, Gaussian Mixture, Agglomerative, BIRCH
Abstract
Malware detection and classification remain critical challenges in cybersecurity, especially as malicious software becomes increasingly sophisticated and prevalent. While much of the work involving embeddings has traditionally relied on supervised learning approaches, there is significant potential in leveraging unsupervised learning techniques to discern hidden structures in malware data. By employing embedding techniques to convert malware samples into high-dimensional vector representations, we can capture the subtle and complex patterns inherent in malicious code without relying on pre-labeled data. This unsupervised approach helps categorize malware into predefined malware families, greatly aiding in developing cybersecurity solutions. In contrast to traditional supervised models that depend heavily on historical data and predefined labels, unsupervised learning facilitates the discovery of novel and
previously unseen malware variants. This research investigates a range of embed- ding methods, including Word2Vec, FastText, and Doc2Vec, paired with various
clustering techniques such as DBSCAN, K-Means, Gaussian Mixture, Agglomerative, and BIRCH. The objective is to comprehensively analyze the combined impact of these methods on malware detection in an unsupervised setting. By shifting the focus towards unsupervised learning, this paper highlights the potential to capture malware’s dynamic and evolving nature, ultimately contributing to more adaptive and resilient cybersecurity strategies.
Recommended Citation
Koul, Ayush, "Comparative Analysis of Embedding Techniques with Clustering Algorithms for Malware Opcodes" (2025). Master's Projects. 1553.
DOI: https://doi.org/10.31979/etd.8duf-qgw4
https://scholarworks.sjsu.edu/etd_projects/1553