Publication Date
Spring 2023
Degree Type
Master's Project
Degree Name
Master of Science in Bioinformatics (MSBI)
Department
Computer Science
First Advisor
Wendy Lee
Second Advisor
Philip Heller
Third Advisor
William Andreopoulos
Keywords
Next-generation sequencing, Artifacts, Bioinformatics, Machine Learning
Abstract
Next Generation Sequencing (NGS) introduces artifactual variants from library preparation methods and errors, which affects the accuracy of variant calling. Whole Exome Sequencing (WES) data from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database is processed. Comparison of single nucleotide polymorphism (SNP) calls to Genome In a Bottle (GIAB) provides labels that are used to build machine learning (ML) models. The left and right flanking region (LSEQ and RSEQ) of each SNP is extracted. Nucleotide frequency, kmers of size 4 and their counts, largest homopolymer size, largest palindrome size, and largest hairpin loop size were computed and used as features in model building. The Random Forest model had a precision of 98.8%, recall of 87.3%, and accuracy of 90.2%. High scores show the model's ability to correctly identify artifacts from non-artifacts and that the results are exceptionally accurate.
Recommended Citation
Lam, Kathy Thanh, "Characterizing Sequencing Artifacts" (2023). Master's Projects. 1282.
DOI: https://doi.org/10.31979/etd.4wdn-prkq
https://scholarworks.sjsu.edu/etd_projects/1282