Publication Date

Spring 2022

Degree Type

Master's Project

Degree Name

Master of Science in Bioinformatics (MSBI)


Computer Science

First Advisor

Wendy Lee


Modeling sequence artifacts, next generation sequencing.


Advancements in Next Generation Sequencing (NGS) have enabled detection of genetic alterations at large scales with high throughputs. NGS offers advantages over the established sequencing method, Sanger sequencing, by processing large sections of the genome simultaneously at a lower cost with higher accuracy. However, recent research has shown that sequencing artifacts are introduced at various steps in the NGS workflow. These artifacts are the result of an accumulation of errors from multiple steps, such as library preparation and downstream processes, and can result in variants being identified that aren’t actually present in the sequenced genome. Therefore, there is a need to accurately distinguish between true variants and sequencing artifacts. This project included the building of a bioinformatics pipeline to process Whole Exome Sequencing (WES) datasets from the Sequence Read Archive (SRA), as well as a high-scale machine learning models to identify errors introduced in the genome sequencing process. Results showed that the models had high classification accuracy, ranging from 98% to 100%, as well as high precision and recall scores around 99% when positively identifying artifacts. One feature, “Allele Frequency” or ‘AF’, was shown to have powerful predictive power, with it alone able to accurately classify 99% of the training data. Since ‘AF’ is an important parameter in variant calling software, a further investigation was conducted, which found that values of ‘AF’ around 0.22 could correctly differentiate most artifacts from non-artifacts. Finally, another investigation was conducted into the predictive power of other features, and identified several other features capable of differentiating artifacts.