Publication Date

Fall 2010

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

Sami Khuri

Second Advisor

Robert Chun

Third Advisor

Xiuduan Fang


profile Hidden Markov Model microRNA


miRNAs are a class of small RNA molecules about 22 nucleotides long that regulate gene expression at the post-transcriptional level. The discovery of the second miRNA 10 years ago was as much a surprise in its own way as the very structure of DNA discovered a half century earlier[1]. How could these small molecules regulate so many genes? During the past decade the complex cascade of regulation has been investigated and reported in detail[2]. The regions of the genome called untranslated regions, or UTRs, proved true to their name: they were indeed untranslated, but certainly not unimportant: they act as the origin and often the destination of miRNAs.

miRBase[3] contains 1048 human miRNAs with more undoubtedly on the way. But experimental identification of miRNA targets has proven dreadfully slow and difficult. Instead, scientists have turned to computational target prediction programs as the preferred method to quickly identify potential miRNA targets. Current prediction tools have produced a huge number of potential target sites, but determining if they are correct, or which algorithms produce the most reliable predictions, remains an open question.

This project examines one type of algorithm, a probabilistic model called a profile Hidden Markov Model (pHMM), and uses it to predict miRNA target sites. HMMs are known to be very effective in pattern recognition and have been successfully applied to various bioinformatic applications, such as gene finding, multiple sequence alignment and protein family classification[4]. We proposed to build a pHMM from known miRNA interactions and use this model to identify potential miRNA target sites in UTR regions by abstracting the Watson-Crick base pairs into meta codes intended to more naturally describe important relationships in RNA folding. High quality positive training data came from the best curated mRNA:miRNA data-bases we could find, while negative training data was generated using random sequences. The purpose of this project was to demonstrate the flexibility of the pHMM architecture to process many kinds of interesting data and by doing so improve their miRNA target site prediction.