Approximate Query Processing with Multiple Similarity Metrics

Download files
Access & Terms of Use
open access
Copyright: Wang, Yaoshu
Altmetric
Abstract
Approximate query processing based on multiple similarity metrics is prevalent and essential for many applications in the database area, such as information retrieval, data mining, pattern matching, and so forth. Almost all existing works adopt the filter-and-verify paradigm. A good filtering method directly affects the final performance and is related to two aspects: small filtering cost and strong pruning power. In this thesis, we mainly designed various powerful filtering algorithms that solve three similarity metrics. The first work is approximate query search with edit distance, which finds all data strings whose edit distances with a given query are no larger than a given threshold. All existing works solve the problem adopting the strategy of finding exact matching substrings, but they all suffer from the situation that common substrings could be very short, low selective, and inefficiency of large threshold. We proposed a new lower bound called local filtering that focuses on finding dissimilar parts between two strings. We designed a series of filtering strategies and novel index structures to quickly retrieve candidates. Finally we show the superiority of our methods over existing works in the experiments. The second work in the thesis is approximate query search in Hamming space, which finds all binary codes whose Hamming distance with the query is no larger than the threshold. The main drawbacks of existing methods are that their bounds are not tight enough, and they cannot solve data skewness. To solve them, we proposed the concept called General Pigeonhole Principle, which allowed allocating different errors to partitions. To reduce the candidate size, we designed a Dynamic Programming based method that allocates thresholds according to the skewness of partitions and a dimension re-arrangement strategy. Our experiments showed that our method outperformed others. The third work in the thesis is Jaro-Winkler distance, which is often used in entity matching and machine learning models. Existing work utilized the Trie index structure, but it was time-consuming. To solve it, we proposed a partition-based approach, and designed an e-variant method to accelerate query processing procedure. Our method is approximately 3 times faster than the existing work.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Wang, Yaoshu
Supervisor(s)
Wang, Wei
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2018
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 2.6 MB Adobe Portable Document Format
Related dataset(s)