Efficient String Similarity Search with an Edit Distance Threshold

Download files
Access & Terms of Use
open access
Copyright: Qin, Jianbin
Altmetric
Abstract
In this thesis, we study efficient exact query processing algorithms for edit similarity queries and their variants. An edit similarity query finds database strings within edit distance threshold t from the query string. It plays an important role in many application areas, such as deduplication, data integration and cleansing, query suggestion, bioinformatics, and pattern recognition. Consequently, there has been much interest in developing efficient algorithms for this problem. We study three specific problems in this thesis. The first problem is efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting non-overlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given dataset. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space. The second problem we study is efficient algorithms for similarity search with an edit distance constraint. We define a general signature-based framework and show that the minimum signature size is t + 1. In order to achieve this bound, we propose asymmetric signature schemes, i.e., extracting different number of signatures for the data and query strings. We first propose a basic asymmetric scheme based on matching q-chunks and q-grams between two strings. Two efficient query processing algorithms (IndexGram and IndexChunk) are developed on top of this scheme. We also propose novel candidate pruning methods to further improve the efficiency. We then generalize the basic scheme by incorporating novel ideas of floating qchunks, optimal selection of q-chunks, and reducing the number of signatures using global ordering. As a result, we develop the Super and Turbo families of schemes together with their corresponding query processing algorithms. We have conducted a comprehensive experimental study using the six asymmetric algorithms and nine previous state-of-the-art algorithms. The experiment results clearly showcase the efficiency of our methods and demonstrate space and time characteristics of our proposed algorithms. The last problem we study is the problem of query autocompletion that tolerates errors in user s inputs modeled by an edit distance constraint. Previous approaches index data strings in a trie, and continuously maintain all the prefixes of data strings whose edit distance with the query are within the threshold. The major inherent problem is that the number of such prefixes is huge for the first few characters of the query and is exponential in the alphabet size. This results in slow query time even if the entire query approximately matches only few prefixes. We propose a novel neighborhood generation-based algorithm, IncNGTrie, which can achieve up to three orders of magnitude speedup over existing methods for the error-tolerant query autocompletion problem. Our proposed algorithm only maintains a small set of active nodes, thus saving both space and time to process the query. We also study efficient duplicate removal methods which is a core problem in fetching query answers. In addition, we propose optimization techniques to reduce our index size, as well as discussions on several extensions to our method. The efficiency of our method is demonstrated against existing methods through extensive experiments on real datasets.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Qin, Jianbin
Supervisor(s)
Wang, Wei
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2013
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download whole.pdf 1.86 MB Adobe Portable Document Format
Related dataset(s)