Property of Density in Entity Resolution and its Usage for Blocking and Learning

Dou, Chenxiao

doi:10.26190/unsworks/19824

Property of Density in Entity Resolution and its Usage for Blocking and Learning

Download files

Access & Terms of Use

open access
Copyright: Dou, Chenxiao

CC BY-NC-ND 3.0

Abstract

Entity Resolution In data engineering refers to searching for data records originating from the same entitles across different data sources. The solutions for Entity Resolution usually employ blocking and learning techniques to distinguish matching records from non-matching records. In this thesis, Density Monotonicity is first introduced to block data. Through clustering candidate data via the density information, most of the non-matches can be correctly detected and blocked. As a result, a more balanced dataset can be acquired. Compared to other blocking approaches that rely heavily on manually designed blocking criteria, the density-driven blocking approach can automatically find a suitable blocking criterion without the supervision of human experts. However, with the big-data era coming, the efficiency of data-intensive algorithms is challenged by large-scale datasets. To overcome the challenge from big data, parallel blocking Is a regular way to enhance blocking efficiency. With the fact that the density property still preserves in any randomly sampled dataset, the centralized blocking algorithm is upgraded to a distributed blocking algorithm. To improve efficiency, a probabilistic technique Is adopted to balance the speed and the effect of the distributed blocking algorithm. After the blocking process, to further retrieve matches from remaining dataset, active learning techniques are adopted in this thesis. With the density property, a novel approach is provided to initialize the classifier. The density-based approach can initialize a high-quality classifier without the involvement of human experts. Through the experiments on real-world datasets, the efficiency and effectiveness of the density-based approaches Is validated. The density-based matching algorithms can achieve a better blocking and learning performance than other state-of-art approaches. Compared to other measures used to detect duplicates, density Information can be attained more easily and cheaply Throughout this thesis, the discovery of the data property and the proposed techniques have been examined through many experiments on real-world data sets and on a real cloud. The experiments related to big data were run in Hadoop MapReduce and Spark Installed In the cloud. The experiments evidence the effectiveness and efficiency of the proposed techniques.

Persistent link to this record

http://hdl.handle.net/1959.4/58435

DOI

https://doi.org/10.26190/unsworks/19824

Author(s)

Dou, Chenxiao

Supervisor(s)

Sun, Daniel

Wong, Raymond

Publication Year

2017

Resource Type

Thesis

Degree Type

PhD Doctorate

UNSW Faculty

Files

public version.pdf

2.69 MB

Adobe Portable Document Format

View full record Show statistics

Library

Property of Density in Entity Resolution and its Usage for Blocking and Learning

Access & Terms of Use

Altmetric

Abstract

Persistent link to this record

DOI

Link to Publisher Version

Link to Open Access Version

Additional Link

Author(s)

Supervisor(s)

Creator(s)

Editor(s)

Translator(s)

Curator(s)

Designer(s)

Arranger(s)

Composer(s)

Recordist(s)

Conference Proceedings Editor(s)

Other Contributor(s)

Corporate/Industry Contributor(s)

Publication Year

Resource Type

Degree Type

UNSW Faculty

Files

Related dataset(s)