Publication Date

Spring 2022

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Fabio Di Troia

Second Advisor

Mark Stamp

Third Advisor

Katerina Potika

Keywords

Contextualized Embeddings, Transformer models, BERT, Bidirectional Language Models, ELMo, Glove, Word2vec, Fasttext, Optuna

Abstract

Malware classification is a technique to classify different types of malware which form an integral part of system security. The aim of this project is to use context dependant word embeddings to classify malware. Tansformers is a novel architecture which utilizes self attention to handle long range dependencies. They are particularly effective in many complex natural language processing tasks such as Masked Lan- guage Modelling(MLM) and Next Sentence Prediction(NSP). Different transfomer architectures such as BERT, DistilBert, Albert, and Roberta are used to generate context dependant word embeddings. These embeddings would help in classifying different malware samples based on their similarity and context.

Apart from using transformer models we also experimented with different bidi- rectional language models sunch as ELMo which can generate contextualized opcode embeddings.This project also discusses algorithms for generating embeddings for byte level N-grams. We utilize Word2vec, Glove and Fasttext algorithms to generate context free embeddings. The classification algorithms employed in this project consist of Resnet-101 CNN, Random forest,Support Vector Machines (SVM), and �� nearest neighbours. Transformer models sometimes act as black boxes which makes it difficult to understand their decisions.Various intrepretable models are utilized to explain their inner workings and improve our understanding of the model to explain their results.

Recommended Citation

Pandya, Vinay, "Contextualized Vector Embeddings for Malware Detection" (2022). Master's Projects. 1083.
DOI: https://doi.org/10.31979/etd.rjra-9c8m
https://scholarworks.sjsu.edu/etd_projects/1083

Download

Included in

Artificial Intelligence and Robotics Commons, Information Security Commons

COinS

DOI

https://doi.org/10.31979/etd.rjra-9c8m

Master's Projects

Contextualized Vector Embeddings for Malware Detection

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Projects

Contextualized Vector Embeddings for Malware Detection

Author

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links