Publication Date
Spring 2023
Degree Type
Master's Project
Degree Name
Master of Science in Bioinformatics (MSBI)
Department
Computer Science
First Advisor
Wendy Lee
Second Advisor
William Andreopoulos
Third Advisor
Cleber Ouverney
Keywords
Sequencing artifacts, Somatic variants, Next-generation sequencing, Machine Learning
Abstract
The rapid advancement in technology for next-generation sequencing (NGS) continues to make NGS more affordable, and in turn, there is a high influx of sequencing data. While NGS is relatively fast and efficient compared to previous sequencing technologies, there are a multitude of steps in the NGS workflow in which sequencing errors can be introduced. Such sequencing errors are known as artifacts, and if not careful, they can be mistaken for true variants. It's especially important to distinguish artifacts in cancer biopsies, specifically liquid biopsies, a noninvasive method for sample collection. Somatic mutations occur at low frequencies, and a liquid biopsy adds another challenge for detection if not enough cancer cells are collected in the sample. Thus, the distinction between low-frequency mutations and low-frequency artifacts becomes more difficult. In this study, machine learning methods will be used to model sequencing artifacts in NGS cancer data. The Genome in a Bottle (GIAB) genomes and BAMSurgeon will be used as "truth-sets" to distinguish true variants from low-frequency sequencing artifacts.
Recommended Citation
Padre, Hannele, "Modeling Sequencing Artifacts in Artificial Low Frequency Cancer Data" (2023). Master's Projects. 1279.
DOI: https://doi.org/10.31979/etd.u9xj-uyt9
https://scholarworks.sjsu.edu/etd_projects/1279