A Comparative Study of Linear and Non-Linear Dimensionality Reduction for Opcode-Frequency Malware Classification
Publication Date
1-28-2026
Document Type
Article
Publication Title
Journal of Computer Virology and Hacking Techniques
Volume
22
Issue
1
DOI
10.1007/s11416-026-00597-1
Abstract
High-dimensional feature spaces in malware classification pose significant challenges for machine learning performance. To address these challenges, this paper presents a comparative evaluation of four dimensionality-reduction techniques–Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Uniform Manifold Approximation and Projection (UMAP), and Autoencoder-based reduction–applied to opcode-frequency representations of malware. Using a corpus comprising 82,569 samples and 1796 opcodes, we analyze the effect of each reduction method across multiple target dimensions and two classifier architectures: Extreme Gradient Boosting (XGBoost) and a three-layer Multilayer Perceptron (MLP). Results show that LDA achieves strong separability at lower dimensions, while PCA performs best at higher dimensions where variance preservation is critical. Autoencoder-based reduction provides consistently high accuracy with compact representations, whereas UMAP exhibits limited benefits for tabular opcode data. The findings highlight trade-offs between linear and non-linear reduction strategies and provide guidance for selecting efficient feature compression methods in large-scale malware analysis.
Keywords
Dimensionality reduction, Machine learning, Malware classification
Department
Computer Science
Recommended Citation
Chandler Lu and Fabio Di Troia. "A Comparative Study of Linear and Non-Linear Dimensionality Reduction for Opcode-Frequency Malware Classification" Journal of Computer Virology and Hacking Techniques (2026). https://doi.org/10.1007/s11416-026-00597-1