Publication Date
Fall 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Fabio Di Troia
Second Advisor
William Andreopoulos
Third Advisor
Sayma Akther
Keywords
API Calls, Malware Classification, Diffusion Models, WGAN-GP, Syn-thetic Data Generation, Natural Language Processing, Machine Learning
Abstract
Malware classification through Application Programming Interface (API) call analysis is essential for modern cybersecurity. However, traditional classification approaches often have to face significant challenges due to limited and imbalanced datasets. Therefore, this project proposes a class-conditional diffusion model designed to generate realistic synthetic API-call embeddings that can be trained based on classes and generate realistic malware API call embeddings for data augmentation. Furthermore, seven embedding techniques are explored: Bag of Words (BoW), TF- IDF, Word2Vec (Skip-gram and CBOW), FastText, Doc2Vec, and DistilBERT. The two best synthetic embeddings will then be compared with the corresponding embeddings generated from Wasserstein GAN with Gradient Penalty (WGAN-GP), another popular generative model. These synthetic embeddings are evaluated through downstream classification performance using Gaussian Naive Bayes, Random Forest, Support Vector Machine (SVM), and Multi-Layer Perceptron (MLP) at two levels, which are 7 malware families and 11 malware categories. Final results demonstrate that mixing synthetic data generation improves classification accuracy by up to to 8.1%. WGAN-GP outperformed diffusion for high-dimensional BoW embeddings, while diffusion showed advantages for low-dimensional TF-IDF embeddings. Optimal ratios ranged from 90% original -10% synthetic to 60%-40% depending on embedding type. BoW and TF-IDF embeddings showed the most consistent improvements. These findings demonstrate that generative model selection should be guided by embedding and dataset characteristics in data-limited scenarios.
Recommended Citation
Le, Dan, "Diffusion Model On API Call Classification" (2025). Master's Projects. 1600.
DOI: https://doi.org/10.31979/etd.y7v8-8n5t
https://scholarworks.sjsu.edu/etd_projects/1600