TY - GEN
T1 - MaxK-GNN
T2 - 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024
AU - Peng, Hongwu
AU - Xie, Xi
AU - Shivdikar, Kaustubh
AU - Hasan, Md Amit
AU - Zhao, Jiahui
AU - Huang, Shaoyi
AU - Khan, Omer
AU - Kaeli, David
AU - Ding, Caiwen
N1 - Publisher Copyright:
© 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/4/27
Y1 - 2024/4/27
N2 - In the acceleration of deep neural network training, the graphics processing unit (GPU) has become the mainstream platform. GPUs face substantial challenges on Graph Neural Networks (GNNs), such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges. However, the memory traffic involved with Sparse-Dense Matrix Matrix Multiplication (SpMM) is still significant.We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an "after-thought"(i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based Sparse Matrix-Matrix Multiplication (SpGEMM) Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and Sampled Sparse Matrix Dense Matrix Multiplication (SSpMM) Kernel.We conduct extensive evaluations of MaxK-GNN and report the system training time. Experiments show that MaxK-GNN system could approach the speedup limit according to Amdahl's law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22×/4.24× speedup (vs. 5.52×/7.27×) on Reddit compared to DGL and GNNAdvisor implementations.
AB - In the acceleration of deep neural network training, the graphics processing unit (GPU) has become the mainstream platform. GPUs face substantial challenges on Graph Neural Networks (GNNs), such as workload imbalance and memory access irregularities, leading to underutilized hardware. Existing solutions such as PyG, DGL with cuSPARSE, and GNNAdvisor frameworks partially address these challenges. However, the memory traffic involved with Sparse-Dense Matrix Matrix Multiplication (SpMM) is still significant.We argue that drastic performance improvements can only be achieved by the vertical optimization of algorithm and system innovations, rather than treating the speedup optimization as an "after-thought"(i.e., (i) given a GNN algorithm, designing an accelerator, or (ii) given hardware, mainly optimizing the GNN algorithm). In this paper, we present MaxK-GNN, an advanced high-performance GPU training system integrating algorithm and system innovation. (i) We introduce the MaxK nonlinearity and provide a theoretical analysis of MaxK nonlinearity as a universal approximator, and present the Compressed Balanced Sparse Row (CBSR) format, designed to store the data and index of the feature matrix after nonlinearity; (ii) We design a coalescing enhanced forward computation with row-wise product-based Sparse Matrix-Matrix Multiplication (SpGEMM) Kernel using CBSR for input feature matrix fetching and strategic placement of a sparse output accumulation buffer in shared memory; (iii) We develop an optimized backward computation with outer product-based and Sampled Sparse Matrix Dense Matrix Multiplication (SSpMM) Kernel.We conduct extensive evaluations of MaxK-GNN and report the system training time. Experiments show that MaxK-GNN system could approach the speedup limit according to Amdahl's law. We achieve comparable accuracy to SOTA GNNs, but at a significantly increased speed: 3.22×/4.24× speedup (vs. 5.52×/7.27×) on Reddit compared to DGL and GNNAdvisor implementations.
KW - GPUs
KW - graph neural network
KW - MaxK nonlinearity
KW - parallel computing
KW - sampled sparse matrix dense matrix multiplication (SSpMM)
KW - SpGEMM
KW - SpMM
UR - http://www.scopus.com/inward/record.url?scp=85192166640&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85192166640&partnerID=8YFLogxK
U2 - 10.1145/3620665.3640426
DO - 10.1145/3620665.3640426
M3 - Conference contribution
AN - SCOPUS:85192166640
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 683
EP - 698
BT - Summer Cycle
Y2 - 27 April 2024 through 1 May 2024
ER -