Approximate Query Processing with Multiple Similarity Metrics

Wang, Yaoshu

doi:10.26190/unsworks/20895

Approximate Query Processing with Multiple Similarity Metrics

Download files

Access & Terms of Use

open access
Copyright: Wang, Yaoshu

CC BY-NC-ND 3.0

Abstract

Approximate query processing based on multiple similarity metrics is prevalent and essential for many applications in the database area, such as information retrieval, data mining, pattern matching, and so forth. Almost all existing works adopt the filter-and-verify paradigm. A good filtering method directly affects the final performance and is related to two aspects: small filtering cost and strong pruning power. In this thesis, we mainly designed various powerful filtering algorithms that solve three similarity metrics. The first work is approximate query search with edit distance, which finds all data strings whose edit distances with a given query are no larger than a given threshold. All existing works solve the problem adopting the strategy of finding exact matching substrings, but they all suffer from the situation that common substrings could be very short, low selective, and inefficiency of large threshold. We proposed a new lower bound called local filtering that focuses on finding dissimilar parts between two strings. We designed a series of filtering strategies and novel index structures to quickly retrieve candidates. Finally we show the superiority of our methods over existing works in the experiments. The second work in the thesis is approximate query search in Hamming space, which finds all binary codes whose Hamming distance with the query is no larger than the threshold. The main drawbacks of existing methods are that their bounds are not tight enough, and they cannot solve data skewness. To solve them, we proposed the concept called General Pigeonhole Principle, which allowed allocating different errors to partitions. To reduce the candidate size, we designed a Dynamic Programming based method that allocates thresholds according to the skewness of partitions and a dimension re-arrangement strategy. Our experiments showed that our method outperformed others. The third work in the thesis is Jaro-Winkler distance, which is often used in entity matching and machine learning models. Existing work utilized the Trie index structure, but it was time-consuming. To solve it, we proposed a partition-based approach, and designed an e-variant method to accelerate query processing procedure. Our method is approximately 3 times faster than the existing work.

Persistent link to this record

http://hdl.handle.net/1959.4/61182

DOI

https://doi.org/10.26190/unsworks/20895

Author(s)

Wang, Yaoshu

Supervisor(s)

Wang, Wei

Publication Year

2018

Resource Type

Thesis

Degree Type

PhD Doctorate

UNSW Faculty

Files

public version.pdf

2.6 MB

Adobe Portable Document Format

View full record Show statistics

Library

Approximate Query Processing with Multiple Similarity Metrics

Access & Terms of Use

Altmetric

Abstract

Persistent link to this record

DOI

Link to Publisher Version

Link to Open Access Version

Additional Link

Author(s)

Supervisor(s)

Creator(s)

Editor(s)

Translator(s)

Curator(s)

Designer(s)

Arranger(s)

Composer(s)

Recordist(s)

Conference Proceedings Editor(s)

Other Contributor(s)

Corporate/Industry Contributor(s)

Publication Year

Resource Type

Degree Type

UNSW Faculty

Files

Related dataset(s)