Publication Date

Fall 2025

Degree Type

Master's Project

Degree Name

Master of Science in Bioinformatics (MSBI)

Department

Computer Science

First Advisor

Dr. Wendy Lee

Second Advisor

Dr. William Andreopoulos

Third Advisor

Dr. Fabio Di Troia

Keywords

Oxford Nanopore Sequencing, Somatic Variant Calling, Sequencing Artifacts, Machine Learning

Abstract

Oxford Nanopore Technology (ONT) is a popular long-read sequencer in genomics. However, its high base-calling error rate produces several sequencing artifacts. Detection of somatic variants in ONT sequenced tumor-normal samples remains challenging due to low frequencies. In this study, machine learning was applied to a dataset created by benchmarking ClairS output against HCC1395 and colo829 truth sets to classify variants and artifacts. Relevant features were engineered from sequence context and variant site characteristics to model artifact profiles. HistGradientBoostingClassifier achieved 0.876950 accuracy, outperforming all other models. Variant quality was the top predictor with an aggregate accuracy of over 85%. This work proves that integrating variant-level features with ensemble learning offers a robust strategy to improve Nanopore-based somatic mutation detection.

Share

COinS