Efficient exact similarity joins

Download files
Access & Terms of Use
open access
Copyright: Xiao, Chuan
Altmetric
Abstract
Similarity joins play an important role in many application areas, such as near duplicate Web page detection, data integration and cleaning, record linkage, and pattern recognition. Consequently, there has been much interest in developing efficient algorithms for this fundamental operation. In this thesis, we investigate four important problems of similarity join. 1. set similarity joins 2. similarity joins with edit constraints 3. top-k set similarity joins 4. approximate entity extraction with edit constraints We first study the problem of set similarity join. Two filtering techniques are developed by exploiting the ordering of tokens. They can be adapted or combined with existing approaches to produce better quality results or improve the runtime efficiency in detecting near duplicate Web pages. To address the problem of similarity joins with edit constraints, we propose a novel perspective of analyzing mismatching q-grams to speed up the similarity join. Two new filtering methods are developed to handle non-clustered edit errors and clustered edit errors. Then we study the top-k set similarity join problem. A novel algorithm is proposed to answer top-k similarity join queries. Opposed to traditional similarity joins algorithms, the proposed algorithm can progressively compute join results without providing a similarity threshold. Finally, we study the problem of approximate entity extraction with edit constraints. We tackle the major technical problem in existing neighborhood generation algorithms. Two novel pruning techniques are proposed to reduce the size of neighborhood, making our approach a highly competitive solution to approximate entity extraction.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Xiao, Chuan
Supervisor(s)
Lin, Xuemin
Wang, Wei
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2010
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download whole.pdf 1.24 MB Adobe Portable Document Format
Related dataset(s)