TY - GEN
T1 - Lightweight dependency checking for parallelizing loops with non-deterministic dependency on GPU
AU - Liu, Hongyuan
AU - Lam, King Tin
AU - Lin, Huanxin
AU - Wang, Cho Li
AU - Ma, Junchao
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/2
Y1 - 2016/7/2
N2 - General-purpose GPUs have been prevalent for a decade. Nevertheless, GPU programming remains an onerous job practically exclusive to veteran developers who must know both domain-specific knowledge and GPU architecture well. Although current parallelizing compilers that automatically parallelize and offload sizable loops onto the GPU have helped in unfettering the power of the GPU with minimal programming effort, there are still a family of loops that carry statically non-deterministic data dependencies and cannot be parallelized. To tackle this issue, we propose two lightweight dependency checking schemes that are very different from existing conservative compilers to assist parallelizing loops with non-deterministic data dependencies. Our schemes feature linear work complexity for memory operations, lower memory consumption compared to previous work, and minimal false positives by leveraging the lockstep execution on the GPU's SIMD lanes. Experiments done using microbenchmarking and real-life applications on the latest advanced AMD discrete GPUs show that our schemes can achieve 2.2 × speedup over existing solutions in dependency-free cases while only taking about 20% of time compared to existing solutions in the case with statically unproven loop-carried dependencies.
AB - General-purpose GPUs have been prevalent for a decade. Nevertheless, GPU programming remains an onerous job practically exclusive to veteran developers who must know both domain-specific knowledge and GPU architecture well. Although current parallelizing compilers that automatically parallelize and offload sizable loops onto the GPU have helped in unfettering the power of the GPU with minimal programming effort, there are still a family of loops that carry statically non-deterministic data dependencies and cannot be parallelized. To tackle this issue, we propose two lightweight dependency checking schemes that are very different from existing conservative compilers to assist parallelizing loops with non-deterministic data dependencies. Our schemes feature linear work complexity for memory operations, lower memory consumption compared to previous work, and minimal false positives by leveraging the lockstep execution on the GPU's SIMD lanes. Experiments done using microbenchmarking and real-life applications on the latest advanced AMD discrete GPUs show that our schemes can achieve 2.2 × speedup over existing solutions in dependency-free cases while only taking about 20% of time compared to existing solutions in the case with statically unproven loop-carried dependencies.
KW - Code Generation;
KW - Dependency Checking
KW - GPGPU
KW - Loop Parallelization
UR - http://www.scopus.com/inward/record.url?scp=85017674902&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85017674902&partnerID=8YFLogxK
U2 - 10.1109/ICPADS.2016.0119
DO - 10.1109/ICPADS.2016.0119
M3 - Conference contribution
AN - SCOPUS:85017674902
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 884
EP - 893
BT - Proceedings - 22nd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2016
A2 - Liao, Xiaofei
A2 - Lovas, Robert
A2 - Shen, Xipeng
A2 - Zheng, Ran
T2 - 22nd IEEE International Conference on Parallel and Distributed Systems, ICPADS 2016
Y2 - 13 December 2016 through 16 December 2016
ER -