Publication Date

Spring 2016

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Tran Duc Thanh

Second Advisor

Thomas Austin

Third Advisor

Subrahmanyam Bolla

Keywords

Pairwise Similarity Apache Spark

Abstract

Entity matching is the process of identifying different manifestations of the same real world entity. These entities can be referred to as objects(string) or data instances. These entities are in turn split over several databases or clusters based on the signatures of the entities. When entity matching algorithms are performed on these databases or clusters, there is a high possibility that a particular entity pair is compared more than once. The number of comparison for any two entities depend on the number of common signatures or keys they possess. This effects the performance of any entity matching algorithm. This paper is the implementation of the algorithm written by Erhard Rahm et al. for performing redundancy free pair-wise similarity computation using MapReduce. As an improvisation to the existing implementation, this project aims to implement the algorithm in Apache Spark in standalone mode for sample of data and in cluster mode for large volume of data.

Recommended Citation

Tirumali, Parineetha Gandhi, "EFFICIENT PAIR-WISE SIMILARITY COMPUTATION USING APACHE SPARK" (2016). Master's Projects. 479.
DOI: https://doi.org/10.31979/etd.sh8a-3gyv
https://scholarworks.sjsu.edu/etd_projects/479

Download

Included in

Databases and Information Systems Commons

COinS

DOI

https://doi.org/10.31979/etd.sh8a-3gyv

Master's Projects

EFFICIENT PAIR-WISE SIMILARITY COMPUTATION USING APACHE SPARK

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Projects

EFFICIENT PAIR-WISE SIMILARITY COMPUTATION USING APACHE SPARK

Author

Publication Date

Degree Type

Degree Name

Department

First Advisor

Second Advisor

Third Advisor

Keywords

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links