Publication Date

Spring 2022

Degree Type

Master's Project

Degree Name

Master of Science in Bioinformatics (MSBI)

Department

Computer Science

First Advisor

Wendy Lee

Keywords

Modeling sequence artifacts, next generation sequencing.

Abstract

Advancements in Next Generation Sequencing (NGS) have enabled detection of genetic alterations at large scales with high throughputs. NGS offers advantages over the established sequencing method, Sanger sequencing, by processing large sections of the genome simultaneously at a lower cost with higher accuracy. However, recent research has shown that sequencing artifacts are introduced at various steps in the NGS workflow. These artifacts are the result of an accumulation of errors from multiple steps, such as library preparation and downstream processes, and can result in variants being identified that aren’t actually present in the sequenced genome. Therefore, there is a need to accurately distinguish between true variants and sequencing artifacts. This project included the building of a bioinformatics pipeline to process Whole Exome Sequencing (WES) datasets from the Sequence Read Archive (SRA), as well as a high-scale machine learning models to identify errors introduced in the genome sequencing process. Results showed that the models had high classification accuracy, ranging from 98% to 100%, as well as high precision and recall scores around 99% when positively identifying artifacts. One feature, “Allele Frequency” or ‘AF’, was shown to have powerful predictive power, with it alone able to accurately classify 99% of the training data. Since ‘AF’ is an important parameter in variant calling software, a further investigation was conducted, which found that values of ‘AF’ around 0.22 could correctly differentiate most artifacts from non-artifacts. Finally, another investigation was conducted into the predictive power of other features, and identified several other features capable of differentiating artifacts.

Recommended Citation

Leong, Yvonna, "Modeling Sequencing Artifacts for Next Generation Sequencing" (2022). Master's Projects. 1092.
DOI: https://doi.org/10.31979/etd.58q9-t2vx
https://scholarworks.sjsu.edu/etd_projects/1092

Download

COinS

DOI

https://doi.org/10.31979/etd.58q9-t2vx

Master's Projects

Modeling Sequencing Artifacts for Next Generation Sequencing

Publication Date

Degree Type

Degree Name

Department

First Advisor

Keywords

Abstract

Recommended Citation

DOI

Search

Browse All

Links

Master's Projects

Modeling Sequencing Artifacts for Next Generation Sequencing

Author

Publication Date

Degree Type

Degree Name

Department

First Advisor

Keywords

Abstract

Recommended Citation

Share

DOI

Search

Browse All

Links