Publication Date


Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science


Virus writers and anti-virus researches generally agree that metamorphism is the way to generate undetectable viruses. Several virus writers have released virus creation kits and claimed that they possess the ability to automatically produce morphed virus variants that look substantially different from one another. To see how effective these code morphing engines are, and how much difference exists between variants of a same virus, we measured the similarity between virus variants generated by four virus generators downloaded from the Internet. Our result shows that the effectiveness of these generators varies widely. While the best generator, NGVCK, is able to create viruses that share only a few percent of similarities, the other generators produce viruses that are over 60% similar, on average. In addition, our similarity graphs show that some of these variant pairs have long segments of identical assembly opcodes at identical positions of the virus files. Compared to random utility files which have a similarity of about 35%, we see that some of the virus creation kits are not very effective. To detect metamorphic virus variants, we experimented with the use of hidden Markov models (HMMs) to capture the statistical properties of viruses in the same family. We generated 200 NGVCK viruses, trained 25 models and used the trained models to classify 65 programs including both NGVCK viruses and other random non-viral programs. For seven of our models we were able to perfectly distinguish the two types of files by their scores. The other cases produced different number of false positives and false negatives, depending on the threshold used in the classifying process. In most cases, our models were able to have a detection rate of over 90% and a false positive rate of less than 10%. The number of states N of a model does not seem to have much impact on the performance of the HMM. We saw only small differences in the performance measures for models with N from 3 to 6. If the variants of a metamorphic virus are sufficiently different that signature-based scanning cannot detect a newly morphed variant, the HMM approach provides a feasible solution. As with any statistical detection method, false predictions are possible. We showed the tradeoff between the detection rate and false positive rate.