Publication Date

Spring 2014

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science


Naive Bayes and Tree Augmented Naive Bayes (TAN) are probabilistic graphical models used
for modeling huge datasets involving lots of uncertainties among its various interdependent
feature sets. Some of the most common applications of these models are image segmentation,
medical diagnosis and various other data clustering and data classification applications. A
classification problem deals with identifying to which category a particular instance belongs to,
based on previous knowledge acquired by analysis of various such instances. The instances are
described using a set of variables called attributes or features. A Naive Bayes model assumes that
all the attributes of an instance are independent of each other given the class of that instance.
This is a very simple representation of the system, but the independence assumptions made in
this model are incorrect and unrealistic. The TAN model improves on the Naive Bayes model by
adding one more level of interaction among attributes of the system. In the TAN model, every
attribute is dependent on its class and one other attribute from the feature set. Since this model
incorporates the dependencies among the attributes, it is more realistic than a Naive Bayes
model. This project analyzes the performance of these two models on various datasets. The TAN
model gives better performance results if there are correlations between the attributes but the
performance is almost the same as that of Naive Bayes model, if there are not enough
correlations between the attributes of the system.