Publication Date

Spring 2022

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

Fabio Di Troia

Second Advisor

Mark Stamp

Third Advisor

Katerina Potika


Contextualized Embeddings, Transformer models, BERT, Bidirectional Language Models, ELMo, Glove, Word2vec, Fasttext, Optuna


Malware classification is a technique to classify different types of malware which form an integral part of system security. The aim of this project is to use context dependant word embeddings to classify malware. Tansformers is a novel architecture which utilizes self attention to handle long range dependencies. They are particularly effective in many complex natural language processing tasks such as Masked Lan- guage Modelling(MLM) and Next Sentence Prediction(NSP). Different transfomer architectures such as BERT, DistilBert, Albert, and Roberta are used to generate context dependant word embeddings. These embeddings would help in classifying different malware samples based on their similarity and context.

Apart from using transformer models we also experimented with different bidi- rectional language models sunch as ELMo which can generate contextualized opcode embeddings.This project also discusses algorithms for generating embeddings for byte level N-grams. We utilize Word2vec, Glove and Fasttext algorithms to generate context free embeddings. The classification algorithms employed in this project consist of Resnet-101 CNN, Random forest,Support Vector Machines (SVM), and 𝑘 nearest neighbours. Transformer models sometimes act as black boxes which makes it difficult to understand their decisions.Various intrepretable models are utilized to explain their inner workings and improve our understanding of the model to explain their results.