Publication Date

Spring 2023

Degree Type

Master's Project

Degree Name

Master of Science in Bioinformatics (MSBI)


Computer Science

First Advisor

Wendy Lee

Second Advisor

William Andreopoulos

Third Advisor

Cleber Ouverney


Sequencing artifacts, Somatic variants, Next-generation sequencing, Machine Learning


The rapid advancement in technology for next-generation sequencing (NGS) continues to make NGS more affordable, and in turn, there is a high influx of sequencing data. While NGS is relatively fast and efficient compared to previous sequencing technologies, there are a multitude of steps in the NGS workflow in which sequencing errors can be introduced. Such sequencing errors are known as artifacts, and if not careful, they can be mistaken for true variants. It's especially important to distinguish artifacts in cancer biopsies, specifically liquid biopsies, a noninvasive method for sample collection. Somatic mutations occur at low frequencies, and a liquid biopsy adds another challenge for detection if not enough cancer cells are collected in the sample. Thus, the distinction between low-frequency mutations and low-frequency artifacts becomes more difficult. In this study, machine learning methods will be used to model sequencing artifacts in NGS cancer data. The Genome in a Bottle (GIAB) genomes and BAMSurgeon will be used as "truth-sets" to distinguish true variants from low-frequency sequencing artifacts.