Off-campus SJSU users: To download campus access theses, please use the following link to log into our proxy server with your SJSU library user name and PIN.

Publication Date

Fall 2020

Degree Type

Thesis - Campus Access Only

Degree Name

Master of Science (MS)


Physics and Astronomy


Aaron Romanowsky


Cosmological Simulations, Dark Matter, Machine Learning, Simulated Galaxies

Subject Areas

Astrophysics; Astronomy; Physics


The application of machine learning (ML) techniques to simulated cosmological data aids in the development of predictive theories of galaxy formation, evolution, and the nature of dark matter (DM) in the Universe. We present the results of a simple binary classification model for predicting the dark matter fraction (DMF) of simulated galaxies using ML techniques such as principal component analysis and random forest (RF) classifier algorithms. The source of the data was The Next Generation Illustris (IllustrisTNG) simulations, which is a series of gravo-magneto-hydrodynamical simulations of the mock Universe. The data consisted of a class distribution imbalanced dataset of 2446 high mass satellite galaxies (i.e., stellar masses ≥ 109 M☉) from the twenty-two most massive simulated galaxy clusters (i.e., total cluster masses > 1014 M☉) in IllustrisTNG. The RF classifier model was trained on simulated galaxy properties (e.g., masses, metallicities, color) and makes predictions on DMF classification labels for classifying galaxies as either DM rich or DM poor (based on a DMF threshold value of 0.8). The RF classifier had an overall accuracy and ROC-AUC score of 92.15% and ∼90%, respectively. The RF predictions for the DM rich majority class had a precision, recall, and F1 score of 93%, 97%, and 95%, respectively. The DM poor minority class, on the other hand, had a precision, recall, and F1 score of 91%, 83%, and 87%, respectively. Thus, the results show that ML classifiers can be employed as novel analytical tools to “measure” hidden galaxy properties, such as the DMF, from simple observable properties with satisfactory results. Furthermore, employing more complex ML algorithms and data sources (e.g., observational data, EAGLE simulations, additional galaxy properties) could help improve the predictive power of the RF model and help gain insights into the DM stripping pathways in galaxy cluster environments.