Publication Date

Spring 2024

Degree Type

Master's Project

Degree Name

Master of Science in Bioinformatics (MSBI)

Department

Computer Science

First Advisor

Dr. Wendy Lee

Second Advisor

Dr. William Andreopoulos

Third Advisor

Dr. Fabio Di Troia

Keywords

Nanopore Sequencing, Deep Learning, Sequencing Artifacts, Sequence Context, Cancer Diagnosis

Abstract

Oxford Nanopore sequencing is a revolutionary new technology for sequencing DNA molecules in long stretches. However, it has a significantly higher error rate than conventional short-read sequencing, resulting in numerous sequencing artifacts. These artifacts can be indistinguishable from low frequency somatic variants, which is a roadblock for cancer diagnosis using liquid biopsies. In this study, benchmarked human genome samples from Genome in a Bottle were used to create a dataset of labeled variants, including artifacts and true variants. Variant features, including sequence context, were used to train various deep learning models. The multi-input neural network combining sequence context features and other variant features resulted in higher validation accuracy (0.871) than the other non-sequence context features alone (0.853), demonstrating that the sequence context surrounding a variant has some predictive power regarding whether a called variant is a sequencing artifact or a true variant.

Share

COinS