Publication Date

Fall 2015

Degree Type

Master's Project


Computer Science


Entity Matching (EM) is a complex problem and has great impact on data quality. In EM we usually match all the combination of entity pairs using different similarity measures and judge if there is any match between entities. Mapreduce based parallel programing model can be used to match these entities. Even distribution of data into the map and reduce tasks will play vital role in the productivity of Mapreduce based programing model. If the dataset is large and has skewed data, then the distribution should be done effectively to achieve load balancing. In this paper, I have implemented an approach of blocking technique called “Block Split”. Block split will reduce the search space of match tasks by splitting larger blocks into multiple small blocks and process it using mapreduce model. This approach utilizes two mapreduce jobs, one to identify the data distribution in each block and use this distribution to perform the match tasks in the second job. The effectiveness of block split approach is described in terms of ‘recall’ and ‘precision’. To improve recall I iteratively applied blocking of different keys by assigning every input record to different blocks (one per blocking key) and then found matches per blocks. Using this we will most likely find more matches but, we may come across many redundant matches. I have optimized the above approach by using “Signature Based Pair Comparison”. We evaluated all our approaches on spark clusters.