Publication Date

Spring 2017

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

Sami Khuri

Second Advisor

Philip Heller

Third Advisor

Katerina Potika


Hidden Markov Model, cryptic splice sites


Splicing is the editing of the precursor mRNA produced during transcription. The mRNA contains a large number of nucleotides in the introns and exons which are spliced to remove the introns and bind the exons to produce the mature mRNA which is translated to generate proteins. Hence accurate splicing at 5’ and 3’ splice sites (authentic splice sites (AuthSS)) is of foremost importance. The 5’ and 3’ splice sites are characterized by consensus sequences. Eukaryotic genome also contains splice sites known as Cryptic Splice Sites (CSS) that match the consensus. But the CSS are activated only when there is a mutation in the gene. Many di erent types of diseases are caused due to the activation of CSS, exon skipping or alteration of alternative splicing [19], such as �-thalassemia, cancer, epilepsy etc. The purpose of this writing project is to design, implement, and evaluate two classi- ers, namely, Hidden Markov Models (HMM), a type of stochastic signal model[11][18] and One-Class Classi cation (OCC) [20][7] decision tree, a non-parametric supervised learning with Information Gain (IG) as a decision metric [5][23] and perform various experiments to better understand the mechanics behind the spliceosome’s selection of CSS. For evaluation, we constructed four datasets. The rst dataset consisted of the au- thentic 5’ splice sites and the second had random sites from HS3D [16]. The other two datasets were constructed from DBASS [4], one of which consists of cryptic 5’ splice sites and the other, neighboring sites. We built two decision trees and two HMMs, one from the authentic 5’ splice site (AuthSS) dataset and the other with the cryptic

splice site (CSS) dataset. .We scored AuthSS and CSS on the AuthSS HMM and got AUCs of 0.88 and 0.86, respectively. Then we scored the CSS and AuthSS on CSS HMM and got AUCs of 0.87 and 0.86, respectively. We then did similar experiments with the decision trees. By scoring AuthSS and CSS on the AuthSS decision tree we got an accuracy rate of 0.83 and 0.78, respectively. We repeated the same experiment on the CSS decision tree and scored the CSS and AuthSS on it and got an accuracy of 0.81 and 0.71, respectively. Thus, we observed that the AuthSS and CSS are intrinsically di erent and hence further experimented to understand the underlying reason for which the spliceosome chose the CSS over other available ’GT’ site. We separately scored the neighboring sites data on the AuthSS and CSS decision trees and got an accuracy rate of 0.52 and 0.55, respec- tively. We also scored the neighboring site dataset on the AuthSS and CSS HMMs and got AUCs of 0.53 and 0.58, respectively. Thus, we can observe that the CSS performed better than the neighboring sites.

Finally, we compared the decision trees to see the degree of similarity between them. We found that the AuthSS and CSS decision trees are 29% similar whereas the AuthSS decision tree and the decision tree built from the neighboring sites are 16% similar. We can conclude that even if the AuthSS and CSS are intrinsically di erent CSS are still better match the consensus sequence than other available ‘GT’ sites. Hence, the spliceosome splices at the CSS when there is a mutation.