Publication Date
Spring 2016
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Tran Duc Thanh
Second Advisor
Thomas Austin
Third Advisor
Subrahmanyam Bolla
Keywords
Pairwise Similarity Apache Spark
Abstract
Entity matching is the process of identifying different manifestations of the same real world entity. These entities can be referred to as objects(string) or data instances. These entities are in turn split over several databases or clusters based on the signatures of the entities. When entity matching algorithms are performed on these databases or clusters, there is a high possibility that a particular entity pair is compared more than once. The number of comparison for any two entities depend on the number of common signatures or keys they possess. This effects the performance of any entity matching algorithm. This paper is the implementation of the algorithm written by Erhard Rahm et al. for performing redundancy free pair-wise similarity computation using MapReduce. As an improvisation to the existing implementation, this project aims to implement the algorithm in Apache Spark in standalone mode for sample of data and in cluster mode for large volume of data.
Recommended Citation
Tirumali, Parineetha Gandhi, "EFFICIENT PAIR-WISE SIMILARITY COMPUTATION USING APACHE SPARK" (2016). Master's Projects. 479.
DOI: https://doi.org/10.31979/etd.sh8a-3gyv
https://scholarworks.sjsu.edu/etd_projects/479