Publication Date

Spring 2017

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

Sami Khuri

Second Advisor

Thomas Austin

Third Advisor

Philip Heller


Cryptic Splice Sites, Genetic Algorithms, Random Forest based classifier


Proteins are building blocks of the bodies of eukaryotes, and the process of synthesizing proteins from DNA is crucial for the good health of an organism [13]. However, some mutations in the DNA may disrupt the selection of 5’ or 3’ splice sites by a spliceosome. An important research question is whether the disruptions have a stochastic relation to the position of nucleotides in the vicinity of the known authentic and cryptic splice sites. This can be achieved by proving that the authentic and cryptic splice sites are intrinsically different. However, the behavior of the spliceosome is not accurately known. Hence, it is a logical step to model the behavior of the spliceosome using an algorithm that is suitable for modeling unknown functions.

Genetic Algorithms have played an important role in heuristically optimizing NP-Hard search problems [8]. An exhaustive search on the splice site data search space in order to determine the spliceosome function is an NP-Hard problem. Thus, spliceosome function is modeled as a search problem and a Genetic Algorithms based framework is created to prove the hypothesis.

A Random Forest based classifier is proposed to be used as the scoring function. It reduces the rigidity of the comparison mechanism used to compare authentic and cryptic splice sites.