Process big data using approximation methods

Chen, Chen

doi:10.26190/unsworks/19276

Process big data using approximation methods

Download files

Access & Terms of Use

open access
Copyright: Chen, Chen

CC BY-NC-ND 3.0

Abstract

Data proliferation makes big data analysis a challenging task. One way to address the issue is to utilize the parallel systems but it is cost consuming. Meanwhile, it has been shown that approximation method with probabilistic guarantee is sufficient in many real applications. In this thesis, we apply sampling and sketching techniques to obtain high-quality approximation results for four important queries. Firstly, we study the gapped set intersection size estimation problem. Given two integer sets and a gap parameter \delta, two elements are deemed as a match if their numeric difference equals \delta or is within \delta. We first distinguish two subtypes of the estimation problem: the point gap estimation and range gap estimation. Then we propose optimized sketch to tackle the two problems efficiently and effectively with theoretical guarantees. We also demonstrate the usage of our proposed techniques in mining top-K related keywords efficiently. Secondly, in response to the emerging call to a fast data visualization, we study the order preserving estimation problem that can retain important data characteristics. We focus on the population mean as our primary estimation function. By dynamically allocating the failure probability, we propose two effective query processing strategies that can preserve the estimated order to be correct with probabilistic guarantees. Finally, motivated by the demand of analyzing large graphs, we investigate two related concepts in social networks, which are influence maximization and closeness centrality. In order to accelerate the process of influence maximization problem, we bring the order of samples into the RIS framework and derive early stop conditions to accelerate the seed set selection procedure. Furthermore, we provide a cost-effective method to find a proper sample size to bound the quality of the returned seed set. For closeness centrality, we aim to find a set of k nodes that has the largest closeness centrality as a whole. We show that the problem is NP-hard, and prove that the objective function is monotonic and submodular. In order to handle large graphs, we propose a sampling based approach. We reduce the cost of computing shortest path distances from the sampled nodes to the other nodes by selecting the nodes incrementally. In addition, optimizations are developed to further reduce the cost.

Persistent link to this record

http://hdl.handle.net/1959.4/57020

DOI

https://doi.org/10.26190/unsworks/19276

Author(s)

Chen, Chen

Supervisor(s)

Wang, Wei

Publication Year

2016

Resource Type

Thesis

Degree Type

PhD Doctorate

UNSW Faculty

Files

public version.pdf

2 MB

Adobe Portable Document Format

View full record Show statistics

Library

Process big data using approximation methods

Access & Terms of Use

Altmetric

Abstract

Persistent link to this record

DOI

Link to Publisher Version

Link to Open Access Version

Additional Link

Author(s)

Supervisor(s)

Creator(s)

Editor(s)

Translator(s)

Curator(s)

Designer(s)

Arranger(s)

Composer(s)

Recordist(s)

Conference Proceedings Editor(s)

Other Contributor(s)

Corporate/Industry Contributor(s)

Publication Year

Resource Type

Degree Type

UNSW Faculty

Files

Related dataset(s)