Author

Dan Le

Publication Date

Fall 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Fabio Di Troia

Second Advisor

William Andreopoulos

Third Advisor

Sayma Akther

Keywords

API Calls, Malware Classification, Diffusion Models, WGAN-GP, Syn-thetic Data Generation, Natural Language Processing, Machine Learning

Abstract

Malware classification through Application Programming Interface (API) call analysis is essential for modern cybersecurity. However, traditional classification approaches often have to face significant challenges due to limited and imbalanced datasets. Therefore, this project proposes a class-conditional diffusion model designed to generate realistic synthetic API-call embeddings that can be trained based on classes and generate realistic malware API call embeddings for data augmentation. Furthermore, seven embedding techniques are explored: Bag of Words (BoW), TF- IDF, Word2Vec (Skip-gram and CBOW), FastText, Doc2Vec, and DistilBERT. The two best synthetic embeddings will then be compared with the corresponding embeddings generated from Wasserstein GAN with Gradient Penalty (WGAN-GP), another popular generative model. These synthetic embeddings are evaluated through downstream classification performance using Gaussian Naive Bayes, Random Forest, Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP) at two levels, which are 7 malware families and 11 malware categories. Final results demonstrate that mixing synthetic data generation improves classification accuracy by up to to 8.1%. WGAN-GP outperformed diffusion for high-dimensional BoW embeddings, while diffusion showed advantages for low-dimensional TF-IDF embeddings. Optimal ratios ranged from 90% original -10% synthetic to 60%-40% depending on embedding type. BoW and TF-IDF embeddings showed the most consistent improvements. These findings demonstrate that generative model selection should be guided by embedding and dataset characteristics in data-limited scenarios.

Available for download on Saturday, December 19, 2026

Share

COinS