Publication Date

Fall 2015

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

Abstract

Entity Matching (EM) is a complex problem and has great impact on data quality. In EM we usually match all the combination of entity pairs using different similarity measures and judge if there is any match between entities. Mapreduce based parallel programing model can be used to match these entities. Even distribution of data into the map and reduce tasks will play vital role in the productivity of Mapreduce based programing model. If the dataset is large and has skewed data, then the distribution should be done effectively to achieve load balancing. In this paper, I have implemented an approach of blocking technique called “Block Split”. Block split will reduce the search space of match tasks by splitting larger blocks into multiple small blocks and process it using mapreduce model. This approach utilizes two mapreduce jobs, one to identify the data distribution in each block and use this distribution to perform the match tasks in the second job. The effectiveness of block split approach is described in terms of ‘recall’ and ‘precision’. To improve recall I iteratively applied blocking of different keys by assigning every input record to different blocks (one per blocking key) and then found matches per blocks. Using this we will most likely find more matches but, we may come across many redundant matches. I have optimized the above approach by using “Signature Based Pair Comparison”. We evaluated all our approaches on spark clusters.

Recommended Citation

Kondra, Akhilesh, "LOAD BALANCING FOR BIG DATA ENTITY MATCHING USING BLOCK SPLIT" (2015). Master's Projects. 457.
DOI: https://doi.org/10.31979/etd.jez2-jkmx
https://scholarworks.sjsu.edu/etd_projects/457

Download

Included in

Computer Sciences Commons

COinS

DOI

https://doi.org/10.31979/etd.jez2-jkmx

Master's Projects

LOAD BALANCING FOR BIG DATA ENTITY MATCHING USING BLOCK SPLIT

Publication Date

Degree Type

Degree Name

Department

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Projects

LOAD BALANCING FOR BIG DATA ENTITY MATCHING USING BLOCK SPLIT

Author

Publication Date

Degree Type

Degree Name

Department

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links