Publication Date
Spring 2024
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Fabio Di Troia
Second Advisor
William Andreopoulos
Third Advisor
Genya Ishigaki
Keywords
Class-imbalance, Undersampling, Oversampling, Hybrid-sampling, Generative Adversarial Networks, Multilayer Perceptron, K-Nearest Neighbors, Support Vector Machine, Random Forest
Abstract
There have been many breakthroughs over the years in the field of Machine Learning to detect and classify malware threats. However, training a holistic machine learning model to effectively classify malware has been an ongoing topic of research. Datasets represent some malware types disproportionately, which can affect the performance of machine learning classifiers. Without ample data, less common but highly dangerous malware can go undetected by classifiers, leading to devastating outcomes. Data balancing techniques have proven to be effective in representing minority classes better and lessening the bias towards the majority class. Also, recent research showed that generative modeling effectively creates synthesized data that closely resemble original data. This paper explores various balancing techniques and generates synthetic opcode sequence data to effectively train machine learning models to better classify malware. We employ oversampling, undersampling, hybrid-sampling, and Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN- GP) to generate fake data samples and compare their effectiveness in tackling the class imbalance problem in multi-class malware classification.
Recommended Citation
John, Ranjit, "Comparing Balancing Techniques for Malware Classification" (2024). Master's Projects. 1353.
DOI: https://doi.org/10.31979/etd.a56z-td5f
https://scholarworks.sjsu.edu/etd_projects/1353