Publication Date
Spring 2019
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Philip Heller
Second Advisor
Sami Khuri
Third Advisor
Wendy Lee
Keywords
Classification, cytochrome c oxidase subunit 1, DNA barcoding, genetic identification, profile hidden Markov models, taxonomy
Abstract
Genetic identification aims to solve the shortcomings of morphological identification. By using the cytochrome c oxidase subunit 1 (COI) gene as the Eukaryotic “barcode,” scientists hope to research species that may be morphologically ambiguous, elusive, or similarly difficult to visually identify. Current COI databases allow users to search only for existing database records. However, as the number of sequenced, potential COI genes increases, COI identification tools should ideally also be informative of novel, previously unreported sequences that may represent new species. If an unknown COI sequence does not represent a reported organism, an ideal identification tool would report taxonomic ranks to which the sequence is likely to belong. A potential solution is to dynamically create profile hidden Markov models (PHMMs): first at the genus level, then at the family level, traversing to higher taxonomic ranks until a significant score is found. This study experiments with creating PHMMs at the genus level, determining thresholds for classification, and assessing the general performance of this method and the requirements for future expansion to higher taxonomic groups. It ultimately determines that this model shows potential, but may require additional data pre-processing and may fall victim to current machine limitations.
Recommended Citation
Sheu, Jessica, "Toward On-demand Profile Hidden Markov Models for Genetic Barcode Identification" (2019). Master's Projects. 671.
DOI: https://doi.org/10.31979/etd.qg3k-5ufh
https://scholarworks.sjsu.edu/etd_projects/671