Publication Date

Spring 2016

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

Tran Duc Thanh

Second Advisor

Thomas Austin

Third Advisor

Subrahmanyam Bolla


Pairwise Similarity Apache Spark


Entity matching is the process of identifying different manifestations of the same real world entity. These entities can be referred to as objects(string) or data instances. These entities are in turn split over several databases or clusters based on the signatures of the entities. When entity matching algorithms are performed on these databases or clusters, there is a high possibility that a particular entity pair is compared more than once. The number of comparison for any two entities depend on the number of common signatures or keys they possess. This effects the performance of any entity matching algorithm. This paper is the implementation of the algorithm written by Erhard Rahm et al. for performing redundancy free pair-wise similarity computation using MapReduce. As an improvisation to the existing implementation, this project aims to implement the algorithm in Apache Spark in standalone mode for sample of data and in cluster mode for large volume of data.