Publication Date
Fall 2024
Degree Type
Master's Project
Degree Name
Master of Science in Bioinformatics (MSBI)
Department
Computer Science
First Advisor
Dr. Wendy Lee
Second Advisor
Dr. William Andreopoulos
Third Advisor
Dr. Teng Moh
Keywords
Low-pass whole genome sequencing, Supervised Learning, Sequencing Artifacts, AdaBoost, Strand Bias
Abstract
Low-pass whole genome sequencing (LP-WGS) provides a cost-effective way to achieve broad genomic coverage, but it comes with the challenge of sequencing artifacts that can complicate accurate variant detection. To address this, we developed a bioinformatics pipeline using Nextflow. Starting with raw sequencing data, the pipeline performed variant calling using VarDict, with Genome in a Bottle (GIAB) high-confidence variants serving as the benchmark for variant validation. We explored machine learning approaches, testing classifiers such as AdaBoost, ExtraTrees, and RandomForest, to evaluate variant classification. Twenty-two features generated by VarDict were fed into Machine Learning pipeline, with AdaBoost standing out for its balance of precision and recall. Features such as Strand Bias Odds Ratio, Fisher p-value, and Allele Frequency emerged as key contributors to accurate classification. This study highlights the potential of combining LP-WGS with machine learning to improve variant detection despite sequencing limitations.
Recommended Citation
Do, Nguyen Mai Anh, "Artifacts in Low-Pass Whole Genome Sequencing" (2024). Master's Projects. 1439.
https://scholarworks.sjsu.edu/etd_projects/1439