TY - GEN
T1 - Analyzing and leveraging remote-core bandwidth for enhanced performance in GPUs
AU - Ibrahim, Mohamed Assem
AU - Liu, Hongyuan
AU - Kayiran, Onur
AU - Jog, Adwait
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - Bandwidth achieved from local/shared caches and memory is a major performance determinant in Graphics Processing Units (GPUs). These existing sources of bandwidth are often not enough for optimal GPU performance. Therefore, to enhance the performance further, we focus on efficiently unlocking an additional potential source of bandwidth, which we call as remote-core bandwidth. The source of this bandwidth is based on the observation that a fraction of data (i.e., L1 read misses) required by one GPU core can also be found in the local (L1) caches of other GPU cores. In this paper, we propose to efficiently coordinate the data movement across cores in GPUs to exploit this remote-core bandwidth. However, we find that its efficient detection and utilization presents several challenges. To this end, we specifically address: A) which data is shared across cores, b) which cores have the shared data, and c) how we can get the data as soon as possible. Our extensive evaluation across a wide set of GPGPU applications shows that significant performance improvement can be achieved at a modest hardware cost on account of the additional bandwidth received from the remote cores.
AB - Bandwidth achieved from local/shared caches and memory is a major performance determinant in Graphics Processing Units (GPUs). These existing sources of bandwidth are often not enough for optimal GPU performance. Therefore, to enhance the performance further, we focus on efficiently unlocking an additional potential source of bandwidth, which we call as remote-core bandwidth. The source of this bandwidth is based on the observation that a fraction of data (i.e., L1 read misses) required by one GPU core can also be found in the local (L1) caches of other GPU cores. In this paper, we propose to efficiently coordinate the data movement across cores in GPUs to exploit this remote-core bandwidth. However, we find that its efficient detection and utilization presents several challenges. To this end, we specifically address: A) which data is shared across cores, b) which cores have the shared data, and c) how we can get the data as soon as possible. Our extensive evaluation across a wide set of GPGPU applications shows that significant performance improvement can be achieved at a modest hardware cost on account of the additional bandwidth received from the remote cores.
KW - Bandwidth
KW - GPUs
KW - Network-on-Chip
UR - https://www.scopus.com/pages/publications/85075469576
UR - https://www.scopus.com/inward/citedby.url?scp=85075469576&partnerID=8YFLogxK
U2 - 10.1109/PACT.2019.00028
DO - 10.1109/PACT.2019.00028
M3 - Conference contribution
AN - SCOPUS:85075469576
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 257
EP - 270
BT - Proceedings - 2019 28th International Conference on Parallel Architectures and Compilation Techniques, PACT 2019
T2 - 28th International Conference on Parallel Architectures and Compilation Techniques, PACT 2019
Y2 - 21 September 2019 through 25 September 2019
ER -