Publication Date

Fall 12-19-2015

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

T. Y. Lin

Second Advisor

Chris Tseng

Third Advisor

Howard Ho


We consider the problem of identifying similarities between different species of DNA. To do this we infer a stochastic finite automata from a given training data and compare it with a test data. The training and test data consist of DNA sequence of different species. Our method first identifies sentences in DNA. To identify sentences we read DNA sequence one character at a time, 3 characters form a codon and codons form proteins (also known as amino acid chains).Each amino acid in proteins belongs to a group. In total we have 5 groups’ polar, non-polar, acidic, basic and stop codons. A protein always starts with a start codon ATG that belongs to the group polar and ends with one of the stop codons that belongs to the group stop codon. After identifying sentences our method converts it into a symbolic representation of strings where each number represents the group to which an amino acid belongs to. We then generate a PTA tree and merge equivalent states to produce a Stochastic Finite Automata for a DNA.

In addition to producing SFA, we apply secondary storage to handle huge DNA sequences. We also explain some concepts that are necessary to understand our paper.