Publication Date
Fall 2025
Degree Type
Master's Project
Degree Name
Master of Science in Bioinformatics (MSBI)
Department
Computer Science
First Advisor
Dr. Wendy Lee
Second Advisor
Dr. William Andreopoulos
Third Advisor
Dr. Fabio Di Troia
Keywords
Oxford Nanopore Sequencing, Somatic Variant Calling, Sequencing Artifacts, Machine Learning
Abstract
Oxford Nanopore Technology (ONT) is a popular long-read sequencer in genomics. However, its high base-calling error rate produces several sequencing artifacts. Detection of somatic variants in ONT sequenced tumor-normal samples remains challenging due to low frequencies. In this study, machine learning was applied to a dataset created by benchmarking ClairS output against HCC1395 and colo829 truth sets to classify variants and artifacts. Relevant features were engineered from sequence context and variant site characteristics to model artifact profiles. HistGradientBoostingClassifier achieved 0.876950 accuracy, outperforming all other models. Variant quality was the top predictor with an aggregate accuracy of over 85%. This work proves that integrating variant-level features with ensemble learning offers a robust strategy to improve Nanopore-based somatic mutation detection.
Recommended Citation
Trikannad, Shwethal Sayeeram, "Characterizing Somatic Variants in Nanopore Data with Machine Learning" (2025). Master's Projects. 1621.
https://scholarworks.sjsu.edu/etd_projects/1621