Publication Date

Fall 2020

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

Wendy Lee

Second Advisor

Fabio Di Troia

Third Advisor

William Andreopoulos


Next generation sequencing workflow, meta-data, sequencing artifacts


Next generation sequencing (NGS) has revolutionized the biological sciences. Today, entire genomes can be rapidly sequenced, enabling advancements in personalized medicine, genetic diseases, and more. The National Center for Biotechnology Information (NCBI) hosts the Sequence Read Archive (SRA) containing vast amounts of valuable NGS data. Recently, research has shown that sequencing errors in conventional NGS workflows are key confounding factors for detecting mutations. Various steps such as sample handling and library preparation can introduce artifacts that affect the accuracy of calling rare mutations. Thus, there is a need for more insight into the exact relationship between various steps of the NGS workflow- the metadata- and sequencing artifacts. This paper presents a new tool called SRAMetadataX that enables researchers to easily extract crucial metadata from SRA submissions. The tool was used to identify eight sequencing runs that utilized hybrid capture or PCR for enrichment. A bioinformatics pipeline was built that identified 298,936 potential sequencing artifacts from the runs. Various machine learning models were trained on the data, and results showed that the models were able to predict enrichment method with about 70% accuracy, indicating that different enrichment methods likely produce specific sequencing artifacts.