Automatic Parallelization of Tiled Stencil Loop Nests on GPUs

Di, Peng

doi:10.26190/unsworks/16812

Publication:

Automatic Parallelization of Tiled Stencil Loop Nests on GPUs

dc.contributor.advisor	Xue, Jingling	en_US
dc.contributor.author	Di, Peng	en_US
dc.date.accessioned	2022-03-21T14:01:16Z
dc.date.available	2022-03-21T14:01:16Z
dc.date.issued	2013	en_US
dc.description.abstract	This thesis attempts to design and implement a compiler framework based on the polyhedral model. The compiler automatically parallelizes loop nests; especially stencil kernels, into efficient GPU code by loop tiling transformations which the polyhedral model describes. To enhance parallel performance, we introduce three practically efficient techniques to process different types of loop nests. The experimental results of our compiler framework have demonstrated that these advanced techniques can outperform previous approaches. Firstly, we aim to find efficient tiling transformations without violating data dependences. How to select a tile's shape and size is an open issue that is performance-critical and influenced by GPU's hardware constraints. We propose an approach to determine the tile shapes out of consideration for improving two-level parallelism of GPUs. The new approach finds appropriate tiling hyperplanes by embedding parallelism-enhancing constraints into the polyhedral model to maximize intra-tile, i.e., intra-SM parallelism. This improves the load balance among the streaming processors (SPs), which execute a wavefront of loop iterations within a tile. We eliminate parallelism-hindering false dependences to optimize inter-tile, i.e., inter-SM parallelism. This improves the load balance among the streaming multiprocessors (SMs), which execute a wavefront of tiles. Furthermore, to avoid combinatorial explosion of tile size's configurations, we present a model-driven approach to automating tile size selection that is performance-critical for loop tiling transformations, especially for DOACROSS loop nests. Our tile size selection model accurately estimates the execution times of tiled loop nests running on GPUs. The selected tile sizes lead to the performance results that are close to the best observed for a range of problem sizes tested. Finally, to address the difficulty and low-performance of parallelizing widely used SOR stencil loop nests, we present a new tiled parallel SOR method, called MLSOR, which admits more efficient data-parallel SIMD execution on GPUs. Unlike the previous two approaches that are dependence-preserving, the basic idea is to algorithmically restructure a stencil kernel based on a non-dependence-preserving parallelization scheme to avoid pipelining for higher parallelism. The new approach can be implemented in compilers through a pattern matching pass to optimize SOR-like DOACROSS loop nests on GPUs.	en_US
dc.identifier.uri	http://hdl.handle.net/1959.4/53495
dc.language	English
dc.language.iso	EN	en_US
dc.publisher	UNSW, Sydney	en_US
dc.rights	CC BY-NC-ND 3.0	en_US
dc.rights.uri	https://creativecommons.org/licenses/by-nc-nd/3.0/au/	en_US
dc.subject.other	Compiler	en_US
dc.subject.other	GPU	en_US
dc.subject.other	Loop tiling	en_US
dc.subject.other	Parallel computing	en_US
dc.title	Automatic Parallelization of Tiled Stencil Loop Nests on GPUs	en_US
dc.type	Thesis	en_US
dcterms.accessRights	open access
dcterms.rightsHolder	Di, Peng
dspace.entity.type	Publication	en_US
unsw.accessRights.uri	https://purl.org/coar/access_right/c_abf2
unsw.identifier.doi	https://doi.org/10.26190/unsworks/16812
unsw.relation.faculty	Engineering
unsw.relation.originalPublicationAffiliation	Di, Peng, Computer Science & Engineering, Faculty of Engineering, UNSW	en_US
unsw.relation.originalPublicationAffiliation	Xue, Jingling, Computer Science & Engineering, Faculty of Engineering, UNSW	en_US
unsw.relation.school	School of Computer Science and Engineering	*
unsw.thesis.degreetype	PhD Doctorate	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: public version.pdf
Size:: 2.69 MB
Format:: application/pdf
Description:

Download

Resource type

Thesis

Publication: Automatic Parallelization of Tiled Stencil Loop Nests on GPUs

Files

Original bundle

Resource type

Publication:

Automatic Parallelization of Tiled Stencil Loop Nests on GPUs