Publication Date

Fall 2023

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Dr. Fabio Di Troia

Second Advisor

Dr. Mark Stamp

Third Advisor

Dr. William Andreopoulos

Keywords

Malware classification, Image-based methods, t-SNE images, EMBER dataset, Feature vectors, SqueezeNet, MobileNet LightGBM, Normalization techniques, norm-1, norm-2, Memory overload, Training process, Malware detection, Comparative analysis.

Abstract

This Master’s project proposes a novel technique for classifying malware using image-based methods. The approach involves generating t-SNE images from the EMBER dataset, which contains one million samples of both malware and benign files, each represented by over 2,000 features. The t-SNE technique is well-suited for capturing intricate patterns in complex datasets because it effectively maintains the local structure. These t-SNE images are then used as inputs to train two lightweight image classification models, SqueezeNet and MobileNet. Additionally, to provide a benchmark for comparison, a non-image classification model using LightGBM is also explored.

As part of the investigation, the project compares two different normalization techniques, norm-1 and norm-2, applied to the feature vectors before converting them into t-SNE images. This comparison allows for a thorough understanding of how normalization affects the results.

Acknowledging the issue presented by excessive CPU memory usage throughout the training phase, the project embraces a practical stance. The training dataset gets partitioned into three distinct batches, facilitating consecutive training sessions on each individual batch. This tactic adeptly tackles memory limitations, thereby guaranteeing the attainability of model training.

The outcomes demonstrate remarkable accuracy scores of 0.914 for SqueezeNet and 0.944 for MobileNet. While these results showcase the promise of imageoriented methodologies in enhancing the identification and categorization of malicious software, it is important to note that the incumbent benchmark method,

LightGBM, still maintains a superior performance with an AUC value of 0.996. Given the absence of significant computational advantages with image-based methods, the recommendation at this time is to continue utilizing LightGBM as the preferred method for malware detection. However, this study provides valuable insights into the potential of image-based approaches and sets the stage for further exploration and refinement in future research.

Available for download on Friday, November 01, 2024

Share

COinS