The stuff that streams are made of: Streaming models for concurrent execution of multiple queries

Robinson, Amos

doi:10.26190/unsworks/21126

The stuff that streams are made of: Streaming models for concurrent execution of multiple queries

Download files

Access & Terms of Use

open access
Copyright: Robinson, Amos

CC BY-NC-ND 3.0

Abstract

To learn from a large dataset, we generally want to perform lots of queries. If we perform each query separately, we may spend more time reading and re-reading the same dataset than we spend computing the answer. Instead of performing each query separately, we would like to amortise the cost of reading the data by performing multiple queries at the same time. Two streaming models for executing multiple queries concurrently are push streams and Kahn process networks. Push streams can be used to execute multiple queries concurrently, but push streams can be unwieldy to use as queries must be constructed ``back-to-front''. We introduce a query language called Icicle, which allows programmers to write and reason about queries using a more familiar array-based semantics, while retaining the execution strategy of push streams. The type system of Icicle guarantees that well-typed query programs have the same semantics whether they are executed as array programs or as stream programs, and that all queries over the same input data can be executed together. However, push streams do not support computations with multiple inputs except for non-deterministically merging two streams. Kahn process networks support multiple inputs and multiple queries, but require dynamic scheduling and inter-process communication, both of which can introduce significant overhead. We introduce a method for taking multiple processes in a Kahn process network and fusing them together into a single process. The fused process communicates through local variables rather than costly communication channels. This fusion method generalises previous work on stream fusion and demonstrates the connection between fusion and the synchronised product operator, which is generally used in the context of verification and model checking, rather than as an optimisation. Some queries must be executed in multiple passes, as they need to read the input data multiple times, or may produce intermediate outputs which are then read back in. For such queries, there are usually many different ways to schedule the work among the separate passes. Prior work demonstrated how integer linear programming (ILP) can be used to find optimal schedules for imperative array programs. However, these approaches can only handle operations that preserve the size of the array, missing out on some optimisation opportunities. We introduce a clustering algorithm for scheduling queries written using array combinators, and extend prior work to support size-changing operations.

Persistent link to this record

http://hdl.handle.net/1959.4/61695

DOI

https://doi.org/10.26190/unsworks/21126

Author(s)

Robinson, Amos

Supervisor(s)

Lippmeier, Ben

Keller, Gabriele

Chakravarty, Manuel

Publication Year

2019

Resource Type

Thesis

Degree Type

PhD Doctorate

UNSW Faculty

Files

public version.pdf

1.45 MB

Adobe Portable Document Format

View full record Show statistics

Library

The stuff that streams are made of: Streaming models for concurrent execution of multiple queries

Access & Terms of Use

Altmetric

Abstract

Persistent link to this record

DOI

Link to Publisher Version

Link to Open Access Version

Additional Link

Author(s)

Supervisor(s)

Creator(s)

Editor(s)

Translator(s)

Curator(s)

Designer(s)

Arranger(s)

Composer(s)

Recordist(s)

Conference Proceedings Editor(s)

Other Contributor(s)

Corporate/Industry Contributor(s)

Publication Year

Resource Type

Degree Type

UNSW Faculty

Files

Related dataset(s)