Publication Date
Spring 2015
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Sami Khuri
Second Advisor
Thomas Austin
Third Advisor
Chris Tsng
Keywords
HMM DNA Motifs Bioinformatics
Abstract
During the process of gene expression in eukaryotes, mRNA splicing is one of the key processes carried out by a complex called spliceosome. Spliceosome guarantees proper removal of introns and joining of exons before the translation process. Precise splicing is essential for the production of functional proteins. Spliceosome detects specific sequence motifs within an mRNA sequence called splice sites. Two of the splice sites are the 5’ and 3’ sites that border all the introns. Normal splicing process if disrupted by mutation may lead to fatal diseases. In this work, we predict splice sites in a human genome using hidden Markov models (HMMs).
Prior to hidden Markov models, we tried to predict splice sites using higher order position weight matrices. Position Weight Matrix (PWM) is a conventional computational method used to represent splice sites or any sequence motif. In a set of aligned sequences, PWM captures the distribution of nucleotides at each position. The performance of simple PWMs in classifying authentic 5 and 3 splice sites and predicting cryptic splice sites in human genes is resonably well [1, 2, 3]. However, they are built by making a strong independence assumption between contiguous and non- contiguous nucleotide positions. Therefore, we developed a higher order PWM method that incorporates maximal dependence decomposition algorithm (MDD) [4] to successfully identify statistically significant splice sites.
Simple PWM also fails to capture sites that lie in both splice site and non-splice site regions. Therefore, we implemented HMMs to overcome this limitation of PWM.
We performed 10-fold cross validation of all the three methods for 5 and 3 authentic human splice sites from the HS3D database [5] and observed that MDD outperforms the other two methods with area under the Receiver Operating Characteristic curve (ROC) to be 0.96 and 0.93, respectively. Similarly, we performed classification of 5 and 3 putative cryptic splice sites in the beta-globin (HBB) and breast cancer type 1 susceptibility protein (BRCA1) genes. We observed that MDD performs very well in classifying both BRCA1 and HBB cryptic splice sites with area under ROC of 0.99, 0.95, 0.89 and 1.0 respectively. However, we also observed that HMMs perform fairly well in classifying splice sites and cryptic splice sites compared to traditional PWM method.
Recommended Citation
Nerli, Santrupti, "Using Hidden Markov Models to Detect DNA Motifs" (2015). Master's Projects. 388.
DOI: https://doi.org/10.31979/etd.qne6-rbsj
https://scholarworks.sjsu.edu/etd_projects/388