Publication Date

Fall 2019

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Leonard Wesley

Second Advisor

Philip Heller

Third Advisor

Skyler Payne

Keywords

Next Generation Sequencing, Mutations, Homopolymer Regions, Pancreatic Cancer, Evidential Reasoning, Support Vector Classifier

Abstract

This study observes how an evidential reasoning approach can be used as a diagnostic tool for early detection of pancreatic cancer. The evidential reasoning model combines the output of a linear Support Vector Classifier (SVC) with factors such as smoking history, health history, biopsy location, NGS technology used, and more to predict the likelihood of the disease. The SVC was trained using genomic data of pancreatic cancer patients derived from the National Cancer Institute (NIH) Genomic Data Commons (GDC). To test the evidential reasoning model, a variety of synthetic data was compiled to test the impact of combinations of different factors. Through experimentation, we monitored how the evidential interval for pancreatic cancer fluctuated based on the inputs that were provided. We observed how the pancreatic cancer evidential interval increased and the machine learning prediction of pancreatic cancer was supported when the input changed from a non-smoker and non-drinker to an individual with a highly active smoking and drinking history. Similarly, we observed how the evidential interval for pancreatic cancer increased significantly when the machine learning prediction for pancreatic cancer was maintained as high and the input of the quality of the sequencing read was changed from a high quantity of cytosine guanine content and homopolymer regions to a moderate quantity of cytosine guanine content and low homopolymer regions; indicating that there was initially a higher likelihood of error in the sequencing reads, resulting in a more inaccurate machine learning output. This experiment shows that an evidence-based approach has the potential to contribute as a diagnostic tool for screening for high-risk groups. Future work should focus on improving the machine learning model by using a larger pancreatic cancer genomic database. Next steps will involve programmatically analyzing real sequencing reads for irregular guanine cytosine content and high homopolymer regions.

Share

COinS