Abstract
Similarity joins play an important role in many application areas, such as near duplicate Web page detection, data integration and cleaning, record linkage, and pattern recognition. Consequently, there has been much interest in developing efficient algorithms for this fundamental operation.
In this thesis, we investigate four important problems of similarity join.
1. set similarity joins
2. similarity joins with edit constraints
3. top-k set similarity joins
4. approximate entity extraction with edit constraints
We first study the problem of set similarity join. Two filtering techniques are developed by exploiting the ordering of tokens. They can be adapted or combined with existing approaches to produce better quality results or improve the runtime efficiency in detecting near duplicate Web pages.
To address the problem of similarity joins with edit constraints, we propose a novel perspective of analyzing mismatching q-grams to speed up the similarity join. Two new filtering methods are developed to handle non-clustered edit errors and clustered edit errors.
Then we study the top-k set similarity join problem. A novel algorithm is proposed to answer top-k similarity join queries. Opposed to traditional similarity joins algorithms, the proposed algorithm can progressively compute join results without providing a similarity threshold.
Finally, we study the problem of approximate entity extraction with edit constraints. We tackle the major technical problem in existing neighborhood
generation algorithms. Two novel pruning techniques are proposed to reduce the size of neighborhood, making our approach a highly competitive solution to approximate entity extraction.