Property of Density in Entity Resolution and its Usage for Blocking and Learning

Download files
Access & Terms of Use
open access
Copyright: Dou, Chenxiao
Altmetric
Abstract
Entity Resolution In data engineering refers to searching for data records originating from the same entitles across different data sources. The solutions for Entity Resolution usually employ blocking and learning techniques to distinguish matching records from non-matching records. In this thesis, Density Monotonicity is first introduced to block data. Through clustering candidate data via the density information, most of the non-matches can be correctly detected and blocked. As a result, a more balanced dataset can be acquired. Compared to other blocking approaches that rely heavily on manually designed blocking criteria, the density-driven blocking approach can automatically find a suitable blocking criterion without the supervision of human experts. However, with the big-data era coming, the efficiency of data-intensive algorithms is challenged by large-scale datasets. To overcome the challenge from big data, parallel blocking Is a regular way to enhance blocking efficiency. With the fact that the density property still preserves in any randomly sampled dataset, the centralized blocking algorithm is upgraded to a distributed blocking algorithm. To improve efficiency, a probabilistic technique Is adopted to balance the speed and the effect of the distributed blocking algorithm. After the blocking process, to further retrieve matches from remaining dataset, active learning techniques are adopted in this thesis. With the density property, a novel approach is provided to initialize the classifier. The density-based approach can initialize a high-quality classifier without the involvement of human experts. Through the experiments on real-world datasets, the efficiency and effectiveness of the density-based approaches Is validated. The density-based matching algorithms can achieve a better blocking and learning performance than other state-of-art approaches. Compared to other measures used to detect duplicates, density Information can be attained more easily and cheaply Throughout this thesis, the discovery of the data property and the proposed techniques have been examined through many experiments on real-world data sets and on a real cloud. The experiments related to big data were run in Hadoop MapReduce and Spark Installed In the cloud. The experiments evidence the effectiveness and efficiency of the proposed techniques.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Dou, Chenxiao
Supervisor(s)
Sun, Daniel
Wong, Raymond
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2017
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 2.69 MB Adobe Portable Document Format
Related dataset(s)