Publication Date
Fall 2023
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Dr. Fabio Di Troia
Second Advisor
Dr. Mark Stamp
Third Advisor
Dr. William Andreopoulos
Keywords
Malware classification, Image-based methods, t-SNE images, EMBER dataset, Feature vectors, SqueezeNet, MobileNet LightGBM, Normalization techniques, norm-1, norm-2, Memory overload, Training process, Malware detection, Comparative analysis.
Abstract
This Master’s project proposes a novel technique for classifying malware using image-based methods. The approach involves generating t-SNE images from the EMBER dataset, which contains one million samples of both malware and benign files, each represented by over 2,000 features. The t-SNE technique is well-suited for capturing intricate patterns in complex datasets because it effectively maintains the local structure. These t-SNE images are then used as inputs to train two lightweight image classification models, SqueezeNet and MobileNet. Additionally, to provide a benchmark for comparison, a non-image classification model using LightGBM is also explored.
As part of the investigation, the project compares two different normalization techniques, norm-1 and norm-2, applied to the feature vectors before converting them into t-SNE images. This comparison allows for a thorough understanding of how normalization affects the results.
Acknowledging the issue presented by excessive CPU memory usage throughout the training phase, the project embraces a practical stance. The training dataset gets partitioned into three distinct batches, facilitating consecutive training sessions on each individual batch. This tactic adeptly tackles memory limitations, thereby guaranteeing the attainability of model training.
The outcomes demonstrate remarkable accuracy scores of 0.914 for SqueezeNet and 0.944 for MobileNet. While these results showcase the promise of imageoriented methodologies in enhancing the identification and categorization of malicious software, it is important to note that the incumbent benchmark method,
LightGBM, still maintains a superior performance with an AUC value of 0.996. Given the absence of significant computational advantages with image-based methods, the recommendation at this time is to continue utilizing LightGBM as the preferred method for malware detection. However, this study provides valuable insights into the potential of image-based approaches and sets the stage for further exploration and refinement in future research.
Recommended Citation
Stowbunenko, Vincent, "Image-Based Classification of Malware using t-SNE Images" (2023). Master's Projects. 1301.
https://scholarworks.sjsu.edu/etd_projects/1301