Faculty Research, Scholarly, and Creative Activity

Context-Aware Natural Language Processing for Malware Detection

Helen Liu, UW College of Engineering
Summer Mccune, University of Kentucky
Quang Duy Tran, San Jose State University
Fabio Di Troia, San Jose State UniversityFollow
Younghee Park, San Jose State University

Publication Date

8-27-2025

Document Type

Conference Proceeding

Publication Title

2025 Silicon Valley Cybersecurity Conference Svcc 2025

DOI

10.1109/SVCC65277.2025.11133638

Abstract

As malware continues to evolve and cyber attacks become increasingly prevalent, it is critical to develop effective malware classification techniques for the detection and prevention of such malicious attacks. In our research, we approach malware classification from a natural language processing perspective to explore how various tokenization techniques on malware opcode features can enhance classification accuracy and combat malware obfuscation and evolution techniques with a deep learning approach, previously unaddressed by existing research. We bridge this gap by conducting extensive hyperparameter tuning experiments that examine the effects of five common tokenization methods, that is, White Space Separation, Top Single Words, Byte-Pair Encoding, WordPiece, and Unigram, and a novel method which we introduce called Top Word Pairs. We then transform the tokenized opcode sequences into feature vectors with four embedding methods, that is, Word2Vec, HMM2Vec, BERT, and ELMo. Finally, we classify malware using three classification techniques: Support Vector Machines, Random Forest, and Multi-Layer Perceptron. Our results show that by choosing the correct tokenization technique based on the embedding and the classification models implemented, the accuracy of malware classification can be improved by up to 2% with our dataset.

Funding Number

2244597

Funding Sponsor

National Science Foundation

Keywords

Computer Security, Embedding Vectors, Machine Learning, Malware Classification, Malware Detection, Natural Language Processing, Tokenization

Department

Computer Science

Recommended Citation

Helen Liu, Summer Mccune, Quang Duy Tran, Fabio Di Troia, and Younghee Park. "Context-Aware Natural Language Processing for Malware Detection" 2025 Silicon Valley Cybersecurity Conference Svcc 2025 (2025). https://doi.org/10.1109/SVCC65277.2025.11133638

Link to Full Text

Find in your library

COinS

Faculty Research, Scholarly, and Creative Activity

Context-Aware Natural Language Processing for Malware Detection

Publication Date

Document Type

Publication Title

DOI

Abstract

Funding Number

Funding Sponsor

Keywords

Department

Recommended Citation

Search

Browse All

Links

Faculty Research, Scholarly, and Creative Activity

Context-Aware Natural Language Processing for Malware Detection

Authors

Publication Date

Document Type

Publication Title

DOI

Abstract

Funding Number

Funding Sponsor

Keywords

Department

Recommended Citation

Share

Search

Browse All

Links