Publication Date

Fall 2015

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science


Entity matching also known as entity resolution, duplicate identification, reference reconciliation or record linkage and is a critically important task for data cleaning and data integration. One can think of it, as the task of finding entities matching to the same entity in the real world. These entities can belong to a single source of data, or distributed data-sources. It takes structured data as an input and process includes comparison of that structured data (entity or database record) with entities present in the knowledge base. For large-scale entity, matching data has to go through some sequence of steps, which includes Evaluation, Preprocessing, Candidate calculation and Classification. The entity matching workflow consists of two strategies: blocking (map) and matching (reduce). Blocking strategy termed as the division of a data source into partitions or blocks. Blocking is helpful to improve performance. Blocking achieves this goal restricting the set of similar entities in the same partition or block and then, comparing the same within blocks. The partitioning makes use of blocking keys and blocking keys are determined from entity's attributes. Partitioning helps to partition data into blocks. Values of one or several attributes form the blocking key. Mostly, the blocking key is concatenation of prefixes of these attributes. The second part of the workflow consists of the strategy for matching. This aims to identify all matching entity pairs within the same partition. To find out matching result, one need to realize comparison result of the pair of entities. A

matching strategy can use several approaches for matching and can combine similarity scores to find if the entity pair is a match or not. The entity-matching model expects the matching strategy to return the list of matching pairs of entities. Thus, by relating the structured data with their most apposite entity, entity matching tries to gain the maximum out of the existing knowledge base. One of the best solutions for Entity Matching would be Dedoop [4], which is Deduplication of Hadoop. Cartesian product causes the workload due to execution with the time complexity of O (n2) and to provide more time for matching techniques to maintain the quality, some load balancing techniques are necessary. Even after the application of blocking, the task of matching i.e. Entity Matching can still be a costly task and can take up to several days for completion if running against large datasets. The MapReduce [2] programming model is perfect to execute EM in parallel. During execution, input file split into multiple parts or chunks. Then, map phase, multiple map tasks can read those parts in parallel, which are nothing but entities. During reduce phase, based on blocking keys, these entities are redistributed among several reduce tasks. This is helpful for grouping together entities with the same blocking key and can be helpful for the application of matching in parallel.