Efficient structure search in large data sets

Bi, Fei

doi:10.26190/unsworks/20432

Efficient structure search in large data sets

Download files

Access & Terms of Use

open access
Copyright: Bi, Fei

CC BY-NC-ND 3.0

Abstract

Nowadays, with the explosion of information, there has been rapid increase in the amount of data being generated and published; therefore there is an emerging need for efficiently managing and processing large data sets. Among the existing problems, structure search is an important topic with many real applications, such as substructure search in social networks and protein interaction networks and community detecting in social/collaboration networks. In this thesis, we study three important problems regarding structure search in large data sets. Firstly, we study the structure search in graph data and focus on subgraph matching over large data graphs, which extracts all subgraph isomorphic embeddings of a query graph q in a large data graph G. For the first time we address the issue of unpromising results by Cartesian products from "dissimilar" vertices. We propose a new framework by postponing the Cartesian products based on the structure of a query to minimize the redundant Cartesian products. We also devise a new linear size auxiliary data structure to generate a matching order and conduct subgraph matching. Secondly, we investigate the "community" structure search in large graphs, and emphasize on top-k influential communities, where each reported community not only is a cohesive subgraph but also has a high influence value. We propose an instance-optimal algorithm LocalSearch whose time complexity is linearly proportional to the size of the smallest subgraph that a correct algorithm needs to access without indexes. In addition, we propose techniques to make LocalSearch progressively compute and report the communities in decreasing influence value order such that k does not need to be specified. Finally, we study the structure search problem in string data sets with edit distance similarity, which retrieves all strings in a string data set that are similar to a query string with respect to a given edit distance threshold. We propose a cross pivotal based approach to fully exploiting the pruning power of multiple pivotal sets, which is a set of non-overlapping signatures for indexing and query processing. We prove theoretically that our cross pivotal filter has stronger pruning power than state-of-the-art filters.

Persistent link to this record

http://hdl.handle.net/1959.4/60228

DOI

https://doi.org/10.26190/unsworks/20432

Author(s)

Bi, Fei

Supervisor(s)

Zhang, Wenjie

Lin, Xuemin

Publication Year

2018

Resource Type

Thesis

Degree Type

PhD Doctorate

UNSW Faculty

Files

public version.pdf

1.82 MB

Adobe Portable Document Format

View full record Show statistics

Library

Efficient structure search in large data sets

Access & Terms of Use

Altmetric

Abstract

Persistent link to this record

DOI

Link to Publisher Version

Link to Open Access Version

Additional Link

Author(s)

Supervisor(s)

Creator(s)

Editor(s)

Translator(s)

Curator(s)

Designer(s)

Arranger(s)

Composer(s)

Recordist(s)

Conference Proceedings Editor(s)

Other Contributor(s)

Corporate/Industry Contributor(s)

Publication Year

Resource Type

Degree Type

UNSW Faculty

Files

Related dataset(s)