Publication Date

1-28-2026

Document Type

Article

Publication Title

Journal of Computer Virology and Hacking Techniques

Volume

22

Issue

1

DOI

10.1007/s11416-026-00597-1

Abstract

High-dimensional feature spaces in malware classification pose significant challenges for machine learning performance. To address these challenges, this paper presents a comparative evaluation of four dimensionality-reduction techniques–Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Uniform Manifold Approximation and Projection (UMAP), and Autoencoder-based reduction–applied to opcode-frequency representations of malware. Using a corpus comprising 82,569 samples and 1796 opcodes, we analyze the effect of each reduction method across multiple target dimensions and two classifier architectures: Extreme Gradient Boosting (XGBoost) and a three-layer Multilayer Perceptron (MLP). Results show that LDA achieves strong separability at lower dimensions, while PCA performs best at higher dimensions where variance preservation is critical. Autoencoder-based reduction provides consistently high accuracy with compact representations, whereas UMAP exhibits limited benefits for tabular opcode data. The findings highlight trade-offs between linear and non-linear reduction strategies and provide guidance for selecting efficient feature compression methods in large-scale malware analysis.

Keywords

Dimensionality reduction, Machine learning, Malware classification

Comments

This version of the article has been accepted for publication, after peer review (when applicable) and is subject to Springer Nature’s AM terms of use, but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: https://doi.org/10.1007/s11416-026-00597-1

Department

Computer Science

Available for download on Wednesday, January 27, 2027

Share

COinS