ActivePCA: A Novel Framework Integrating PCA and Active Machine Learning for Efficient Dimension Reduction

Publication Date

1-1-2024

Document Type

Conference Proceeding

Publication Title

Proceedings - 2024 IEEE 48th Annual Computers, Software, and Applications Conference, COMPSAC 2024

DOI

10.1109/COMPSAC61105.2024.00052

First Page

320

Last Page

325

Abstract

In medical data analysis, addressing challenges from high-dimensional datasets is crucial due to issues related to computational complexity, resource utilization, and model interpretability. Principal Component Analysis (PCA), a prevalent dimension reduction technique, aims to tackle these challenges by transforming high-dimensional data into a lower-dimensional representation while preserving maximum variance. However, PCA faces limitations in high-dimensional contexts, potentially leading to information loss and increased computational demands, particularly for sizable datasets, as PCA uses the entire dataset in the transformation process. In this paper, we propose a novel framework ActivePCA that integrates PCA and Active Machine Learning (AML) to leverage a subset of datasets in the dimension reduction process. The framework selectively identifies most informative instances from the dataset in the first step. In the second step, ActivePCA applies PCA on the selected subset of the dataset only. To demonstrate effectiveness, we applied our proposed framework to six different EHR datasets with varying dimensions. The framework significantly reduces both the number of observations and dimensions of datasets utilizing AML and PCA, respectively, resulting in improved performance from ML classifiers. ActivePCA approximately reduces 50% to 80% labeling cost on the EHR datasets compared to the original dimensions of the datasets. In addition, ActivePCA achieves significantly higher accuracy using the reduced dimensions, showing the effectiveness of AML while applying PCA.

Keywords

Active Machine Learning, Dimension Reduction, Electronic Health Records Datasets, PCA, Reduce Labeling Cost

Department

Applied Data Science

Share

COinS