Approximate similarity search in high dimensional spaces: solutions, evaluations and applications

Download files
Access & Terms of Use
open access
Copyright: Sun, Yifang
Altmetric
Abstract
In this thesis, we study high dimensional approximate similarity search algorithms. High dimensional similarity searches find objects in a reference database that are similar to a query object. It is a fundamental and essential task in applications from a diverse range of domains, including database, machine learning, multimedia, recommendation systems, and computer vision. The high dimensional similarity search problem suffers from “the curse of dimensionality”, as finding the exact answers in high dimensional spaces is costly, and brute-force linear scan is empirically the best exact similarity search method with linear space and time cost. However, the curse of dimensionality can be alleviated by sacrificing the quality of the returned answers (e.g., approximation). In this thesis, we tackle this problem through several different directions. We firstly solve the c-approximate nearest neighbor search problem in high dimensional Euclidean space with random projections. Currently, the most popular approach for this problem is Locality sensitive hashing (LSH). Unlike the existing LSH based methods, which usually require large index space, we propose several surprisingly simple methods (i.e., SRS) to answer c-ANN queries with theoretical guarantees requiring only a single tiny index. Our methods are highly flexible and support a variety of functionalities, such as finding the exact nearest neighbor with any given probability. We demonstrate experimentally that our methods are superior against the state-of-the-art LSH-based methods, and scale up well to 1 billion high-dimensional points on a single commodity PC. Moreover, motivated by the empirical observation of SRS, we analyze the SRS algorithm rigorously and generalize it to a simpler and more flexible frame- work. Our framework works for several types of distance-based queries for high- dimensional data based on p-stable random projections. By using the framework, we solve the approximate furthest neighbor problem, and extend the solution of c-ANN problem from L2 norm to arbitrary Lp norm with 0 < p <= 2. Experimental results demonstrate the effciency and effectiveness of our proposed solutions. Due to the applications from a diverse range of domains, nearest neighbor searches have attracted attentions from different research areas. Although dozens of algorithms have been continuously proposed in the literature in several research areas, there is no comprehensive evaluation and analysis of their performances. In this thesis, we conduct a comprehensive experimental evaluation of many state-of-the- art methods for nearest neighbor searches. The experimental results are carefully analyzed in different perspectives. This also gives us invaluable insights and many ideas. We showcase the benefit by describing a new method that achieves both high query efficiency and ultra-high recall empirically on majority of the datasets and under a wide range of settings. In this thesis, we also show a real application of high dimensional similarity search. we present a near duplicate text detection framework, where the key part is to find high quality candidates with signature based similarity search method. We propose a novel signature selection algorithm which uses collection frequency of q-grams. We compare our algorithm with Winnowing, which is one of the state-of-the-art signature selection algorithms, and show that our algorithm acquires much better accuracy with less time and space cost with extensive experiments.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Sun, Yifang
Supervisor(s)
Wang, Wei
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2016
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 3.14 MB Adobe Portable Document Format
Related dataset(s)