Author

Ranjit John

Publication Date

Spring 2024

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Fabio Di Troia

Second Advisor

William Andreopoulos

Third Advisor

Genya Ishigaki

Keywords

Class-imbalance, Undersampling, Oversampling, Hybrid-sampling, Generative Adversarial Networks, Multilayer Perceptron, K-Nearest Neighbors, Support Vector Machine, Random Forest

Abstract

There have been many breakthroughs over the years in the field of Machine Learning to detect and classify malware threats. However, training a holistic machine learning model to effectively classify malware has been an ongoing topic of research. Datasets represent some malware types disproportionately, which can affect the performance of machine learning classifiers. Without ample data, less common but highly dangerous malware can go undetected by classifiers, leading to devastating outcomes. Data balancing techniques have proven to be effective in representing minority classes better and lessening the bias towards the majority class. Also, recent research showed that generative modeling effectively creates synthesized data that closely resemble original data. This paper explores various balancing techniques and generates synthetic opcode sequence data to effectively train machine learning models to better classify malware. We employ oversampling, undersampling, hybrid-sampling, and Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN- GP) to generate fake data samples and compare their effectiveness in tackling the class imbalance problem in multi-class malware classification.

Available for download on Thursday, May 22, 2025

Share

COinS