Publication Date

Spring 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Fabio Di Troia

Second Advisor

Mark Stamp

Third Advisor

Faranak Abri

Keywords

Dimension reduction, malware visualization, natural language pro- cessing, machine learning

Abstract

Machine learning has become a popular and powerful tool for malware analysis and detection. With the rise in popularity of natural language processing (NLP) techniques, researchers can now extract contextual embeddings from malware opcode sequences, enabling the capability to analyze hidden malware patterns and advanced code obfuscation strategies. However, unlike malware binaries, which can be directly visualized as images, these embeddings exist in high-dimensional spaces, making it difficult to observe their global patterns or spatial structures. In this paper, we propose a framework for visualizing malware embeddings in lower-dimensional space using various dimensionality reduction techniques. Our approach converts malware binaries into mnemonic opcode sequences, applies NLP models to generate embeddings, and projects these embeddings into lower-dimensional spaces for visualization. This enables us to evaluate how well different NLP techniques capture structural patterns across malware families. Experimental results show that Word2Vec outperforms BERT and GloVe in preserving both intra-family (local) and inter-family (global) structures in the reduced space. These findings are consistent with prior research highlighting Word2Vec’s effectiveness in generating meaningful malware representations. Our framework can be utilized as a visual evaluation metric that leverages low-dimensional projections to assess the quality of malware embeddings. This aids in selecting the most suitable NLP technique for capturing the structural characteristics of malware.

Available for download on Monday, May 25, 2026

Share

COinS