Abstract
With the prevalence of social media and GPS-enabled devices, a massive amount of geo-textual data has been continuously generated in a stream fashion. In this thesis, we study the problem of efficiently processing streaming geo-textual data over publish/subscribe systems (pub/sub for short), which has broad applications in location-based advertising and information dissemination. In a spatial-keyword pub/sub system, users can register their interest as spatial-keyword subscriptions (e.g., interest in nearby restaurant discount); a stream of geo-textual messages (e.g., geo-tagged e-coupons) released by publishers will be delivered to the relevant subscriptions continuously. We comprehensively study three important aspects regarding spatial-keyword pub/sub systems as follows.
Firstly, we investigate boolean-based spatial-keyword pub/sub, where a message is delivered to a subscription if it contains all the subscription keywords and falls inside the subscription range. We tackle both stationary subscriptions and moving subscriptions by proposing a novel adaptive indexing structure, which significantly reduces the processing time of incoming messages.
Secondly, we study ranking-based spatial-keyword pub/sub, where we continuously maintain top-k most relevant messages for all the subscriptions over a sliding window. A novel index which seamlessly integrates both spatial-based and keyword-based pruning rules is proposed to support efficient message dissemination. A cost-based re-evaluation technique is further developed to reduce the number of re-evaluations. This is the first work to investigate spatial-keyword pub/sub over sliding window.
Finally, we investigate distributed stream processing, where we process a continuous data stream in a distributed manner. We first study distributed stream similarity join over textual data. We develop a novel length-based distribution framework to dispatch incoming data by the number of tokens inside, which incurs no data replication, small communication cost and high throughput. We also design a bundle-based local index to facilitate the local join by grouping similar objects. We then consider geo-textual data by extending ranking-based spatial-keyword pub/sub into a distributed environment. Efficient distribution mechanisms are developed to achieve load balance and high throughput. This is the first work that systematically studies ranking-based spatial-keyword pub/sub in a distributed stream environment.