Publication Date

Spring 5-22-2017

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

Sami Khuri

Second Advisor

Philip Heller

Third Advisor

Robert Chun


DNA in the nucleus of all eukaryotes is transcribed into mRNA where it is then translated into proteins. The DNA which is transcribed into mRNA is composed of coding and non-coding regions called exons and introns, respectively. It undergoes a post-trancriptional process called splicing where the introns or the non-coding regions are removed from the pre-mRNA to give the mature mRNA. Splicing of pre-mRNAs at 5 ́ and 3ˊ ends is a crucial step in the gene expression pathway. The mis-splicing by the spliceosome at different sites known as cryptic splice sites is caused by mutations which will affect the primary mRNA product formed and eventually the protein that is created. This leads to devastating genetic diseases.

Consequently it is of extreme importance to understand the reason behind the mis-splicing caused by the mutation and why particular splice sites known as cryptic splice sites are chosen instead. This work aims to answer this central question. It aims to understand why known cryptic splice sites are selected over authentic splice sites and whether we can detect and predict putative cryptic splice sites in the human genome.

This project utilizes two different probabilistic models, namely position weight matrices and hidden Markov models, to answer this question. Position weight matrix is a widely used computational method in bioinformatics and is used to represent motifs in biological sequences. Hidden Markov Model is a statistical method of modeling a system that has several unobserved or hidden states. It is an effective method for representing the probability distribution over several observable sequences. We utilized the Baum-Welch algorithm for successfully training the model to accurately calculate the probability of an observation sequence. We finally utilized the Forward algorithm in order to learn from the trained model and determine the likelihood of an observed sequence for that model.