Process big data using approximation methods

Download files
Access & Terms of Use
open access
Copyright: Chen, Chen
Altmetric
Abstract
Data proliferation makes big data analysis a challenging task. One way to address the issue is to utilize the parallel systems but it is cost consuming. Meanwhile, it has been shown that approximation method with probabilistic guarantee is sufficient in many real applications. In this thesis, we apply sampling and sketching techniques to obtain high-quality approximation results for four important queries. Firstly, we study the gapped set intersection size estimation problem. Given two integer sets and a gap parameter \delta, two elements are deemed as a match if their numeric difference equals \delta or is within \delta. We first distinguish two subtypes of the estimation problem: the point gap estimation and range gap estimation. Then we propose optimized sketch to tackle the two problems efficiently and effectively with theoretical guarantees. We also demonstrate the usage of our proposed techniques in mining top-K related keywords efficiently. Secondly, in response to the emerging call to a fast data visualization, we study the order preserving estimation problem that can retain important data characteristics. We focus on the population mean as our primary estimation function. By dynamically allocating the failure probability, we propose two effective query processing strategies that can preserve the estimated order to be correct with probabilistic guarantees. Finally, motivated by the demand of analyzing large graphs, we investigate two related concepts in social networks, which are influence maximization and closeness centrality. In order to accelerate the process of influence maximization problem, we bring the order of samples into the RIS framework and derive early stop conditions to accelerate the seed set selection procedure. Furthermore, we provide a cost-effective method to find a proper sample size to bound the quality of the returned seed set. For closeness centrality, we aim to find a set of k nodes that has the largest closeness centrality as a whole. We show that the problem is NP-hard, and prove that the objective function is monotonic and submodular. In order to handle large graphs, we propose a sampling based approach. We reduce the cost of computing shortest path distances from the sampled nodes to the other nodes by selecting the nodes incrementally. In addition, optimizations are developed to further reduce the cost.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Chen, Chen
Supervisor(s)
Wang, Wei
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2016
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 2 MB Adobe Portable Document Format
Related dataset(s)