Compiler techniques for improving SIMD parallelism

Download files
Access & Terms of Use
open access
Copyright: Zhou, Hao
Altmetric
Abstract
Modern CPUs are equipped with Single Instruction Multiple Data (SIMD) engines operating on short vectors, in order to meet the growing demands of accelerating multimedia and scientific applications. Automatic simdization performed by the compiler, as a convenient way of utilizing the SIMD ISA extensions, however, faces challenges of translating SIMD resource into actual performance. Despite the fact that discovering SIMD parallelism is sometimes difficult due to the imprecise dependence analysis, for some SIMD parallelism that can be identified, current vectorization techniques are ineffective. In this thesis, we examine mixed SIMD parallelism and partial SIMD parallelism, which are both poorly exploited by the existing approaches. Two new vectorization techniques, i.e., Loop-Mix and Paver, are proposed, to tackle them respectively. Existing loop vectorization techniques exploit either intra- or inter-iteration SIMD parallelism alone in a code region, if one part of the region vectorized for one type of parallelism has data dependences (called mixed-parallelism-inhibiting dependences) on the other part of the region vectorized for the other type of parallelism. We consider a class of loops that exhibit both types of parallelism in its code regions that contain mixed-parallelism-inhibiting data dependences (i.e., mixed SIMD parallelism). We present a new compiler approach Loop-Mix for exploiting such mixed SIMD parallelism effectively by reducing the data reorganization overhead incurred when one type of parallelism is switched to the other. Additionally, existing loop vectorization techniques are ineffective for loops that exhibit little loop-level parallelism but some limited superword-level parallelism (SLP), where the SIMD parallelism is insufficient to fulfil the SIMD datapath (i.e., partial SIMD parallelism). We show that effectively vectorizing such loops requires partial vector operations to be executed correctly and efficiently. We present a simple yet effective SLP compiler technique, called Paver (PArtial VEctorizeR), formulated as a generalization of the traditional SLP algorithm, to optimize such partially vectorizable loops. The key idea is to maximize SIMD utilization by widening vector instructions used while minimizing the overheads caused by memory access, packing/unpacking, and/or masking operations, without introducing new memory errors or new numeric exceptions. Both Loop-Mix and Paver are simple and have been implemented in LLVM. We evaluate them with several real-world programs containing mixed or partial SIMD parallelism and demonstrate their performance advantages over the state-of-the-art.
Persistent link to this record
Link to Publisher Version
Link to Open Access Version
Additional Link
Author(s)
Zhou, Hao
Supervisor(s)
Xue, Jingling
Creator(s)
Editor(s)
Translator(s)
Curator(s)
Designer(s)
Arranger(s)
Composer(s)
Recordist(s)
Conference Proceedings Editor(s)
Other Contributor(s)
Corporate/Industry Contributor(s)
Publication Year
2016
Resource Type
Thesis
Degree Type
PhD Doctorate
UNSW Faculty
Files
download public version.pdf 5.28 MB Adobe Portable Document Format
Related dataset(s)