Abstract
Recent decades witnessed a rapid proliferation of graph data, such as chemical
structures and business processes. Structural queries are frequently issued in these
domains, and hence, attract extensive attention. Due to data inconsistency and
noise, a recent trend is to study similarity queries. Moreover, graphs may possess
multiple attributes on vertices, imposing additional challenges. Hence, indexing
and processing methods for answering advanced structural queries are strongly in
demand.
The thesis studies three types of emerging structural queries in large-scale graph
databases, including graph similarity join, graph similarity search and graph substructure
selection.
Graph similarity join retrieves all similarity pairs from two graph collections
such that their similarity satisfies a given threshold. Particularly, we incorporate
graph edit distance as similarity measure. Inspired by the q-gram idea for string
similarity queries, we introduce path-based q-grams on graphs, and establish a
count filtering lower bound for generating candidates. An efficient algorithm is
proposed to exploit both matching and mismatching features. We also present
label and degree-based matching conditions to strengthen the filtering techniques.
Leveraging offline index, we investigate efficient graph similarity search. Existing
solutions employ fixed-size overlapping substructures, and thus, are susceptible to large vertex degrees and thresholds. Disparately, we present a partition-based
approach by dividing graphs into variable-size non-overlapping partitions. The edit
distance constraint is converted to a containment constraint for candidate generation.
A dynamic pruning technique and an improved verification algorithm are
also developed. In addition, a cost-aware graph partitioning technique is conceived
to optimize the index.
Equipping structural queries with rich semantics, substructure selection over
multi-attributed graphs finds all subgraph isomorphic mappings such that each pair
of matching vertices satisfy a set of selection conditions, each against an equality,
range, or set containment operator on a vertex attribute. Existing techniques
for single-labeled graphs are developed under the assumption of identical label
matching, and thus, cannot handle the general cases. To this end, we propose
a two-tier index via judiciously materializing feature mappings. We also devise
a scalable indexing and query processing method exploiting execution plans with
minimum cost.