Efficient structure search in large data sets

Download files
Access & Terms of Use
open access
Copyright: Bi, Fei
Altmetric
Abstract
Nowadays, with the explosion of information, there has been rapid increase in the amount of data being generated and published; therefore there is an emerging need for efficiently managing and processing large data sets. Among the existing problems, structure search is an important topic with many real applications, such as substructure search in social networks and protein interaction networks and community detecting in social/collaboration networks. In this thesis, we study three important problems regarding structure search in large data sets. Firstly, we study the structure search in graph data and focus on subgraph matching over large data graphs, which extracts all subgraph isomorphic embeddings of a query graph q in a large data graph G. For the first time we address the issue of unpromising results by Cartesian products from "dissimilar" vertices. We propose a new framework by postponing the Cartesian products based on the structure of a query to minimize the redundant Cartesian products. We also devise a new linear size auxiliary data structure to generate a matching order and conduct subgraph matching. Secondly, we investigate the "community" structure search in large graphs, and emphasize on top-k influential communities, where each reported community not only is a cohesive subgraph but also has a high influence value. We propose an instance-optimal algorithm LocalSearch whose time complexity is linearly proportional to the size of the smallest subgraph that a correct algorithm needs to access without indexes. In addition, we propose techniques to make LocalSearch progressively compute and report the communities in decreasing influence value order such that k does not need to be specified. Finally, we study the structure search problem in string data sets with edit distance similarity, which retrieves all strings in a string data set that are similar to a query string with respect to a given edit distance threshold. We propose a cross pivotal based approach to fully exploiting the pruning power of multiple pivotal sets, which is a set of non-overlapping signatures for indexing and query processing. We prove theoretically that our cross pivotal filter has stronger pruning power than state-of-the-art filters.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Bi, Fei
Supervisor(s)
Zhang, Wenjie
Lin, Xuemin
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2018
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 1.82 MB Adobe Portable Document Format
Related dataset(s)