Publication Date

Fall 2024

Degree Type

Master's Project

Degree Name

Master of Science in Bioinformatics (MSBI)

Department

Computer Science

First Advisor

Dr. Wendy Lee

Second Advisor

Dr. William Andreopoulos

Third Advisor

Dr. Teng Moh

Keywords

Low-pass whole genome sequencing, Supervised Learning, Sequencing Artifacts, AdaBoost, Strand Bias

Abstract

Low-pass whole genome sequencing (LP-WGS) provides a cost-effective way to achieve broad genomic coverage, but it comes with the challenge of sequencing artifacts that can complicate accurate variant detection. To address this, we developed a bioinformatics pipeline using Nextflow. Starting with raw sequencing data, the pipeline performed variant calling using VarDict, with Genome in a Bottle (GIAB) high-confidence variants serving as the benchmark for variant validation. We explored machine learning approaches, testing classifiers such as AdaBoost, ExtraTrees, and RandomForest, to evaluate variant classification. Twenty-two features generated by VarDict were fed into Machine Learning pipeline, with AdaBoost standing out for its balance of precision and recall. Features such as Strand Bias Odds Ratio, Fisher p-value, and Allele Frequency emerged as key contributors to accurate classification. This study highlights the potential of combining LP-WGS with machine learning to improve variant detection despite sequencing limitations.

Share

COinS