TY - GEN
T1 - Why GPUs are slow at executing NFAs and how to make them faster
AU - Liu, Hongyuan
AU - Pai, Sreepathi
AU - Jog, Adwait
N1 - Publisher Copyright:
© 2020 Association for Computing Machinery.
PY - 2020/3/9
Y1 - 2020/3/9
N2 - Non-deterministic Finite Automata (NFA) are space-efficient finite state machines that have significant applications in domains such as pattern matching and data analytics. In this paper, we investigate why the Graphics Processing Unit (GPU)-a massively parallel computational device with the highest memory bandwidth available on general-purpose processors-cannot efficiently execute NFAs. First, we identify excessive data movement in the GPU memory hierarchy and describe how to privatize reads effectively using GPU's on-chip memory hierarchy to reduce this excessive data movement. We also show that in several cases, indirect table lookups in NFAs can be eliminated by converting memory reads into computation, to further reduce the number of memory reads. Although our optimization techniques significantly alleviate these memory-related bottlenecks, a side effect of these techniques is the static assignment of work to cores. This leads to poor compute utilization, where GPU cores are wasted on idle NFA states. Therefore, we propose a new dynamic scheme that effectively balances compute utilization with reduced memory usage. Our combined optimizations provide a significant improvement over the previous state-ofthe-art GPU implementations of NFAs. Moreover, they enable current GPUs to outperform the domain-specific accelerator for NFAs (i.e., Automata Processor) across several applications while performing within an order of magnitude for the rest of the applications.
AB - Non-deterministic Finite Automata (NFA) are space-efficient finite state machines that have significant applications in domains such as pattern matching and data analytics. In this paper, we investigate why the Graphics Processing Unit (GPU)-a massively parallel computational device with the highest memory bandwidth available on general-purpose processors-cannot efficiently execute NFAs. First, we identify excessive data movement in the GPU memory hierarchy and describe how to privatize reads effectively using GPU's on-chip memory hierarchy to reduce this excessive data movement. We also show that in several cases, indirect table lookups in NFAs can be eliminated by converting memory reads into computation, to further reduce the number of memory reads. Although our optimization techniques significantly alleviate these memory-related bottlenecks, a side effect of these techniques is the static assignment of work to cores. This leads to poor compute utilization, where GPU cores are wasted on idle NFA states. Therefore, we propose a new dynamic scheme that effectively balances compute utilization with reduced memory usage. Our combined optimizations provide a significant improvement over the previous state-ofthe-art GPU implementations of NFAs. Moreover, they enable current GPUs to outperform the domain-specific accelerator for NFAs (i.e., Automata Processor) across several applications while performing within an order of magnitude for the rest of the applications.
KW - Finite State Machine
KW - GPU
KW - Parallel Computing
UR - http://www.scopus.com/inward/record.url?scp=85082399596&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85082399596&partnerID=8YFLogxK
U2 - 10.1145/3373376.3378471
DO - 10.1145/3373376.3378471
M3 - Conference contribution
AN - SCOPUS:85082399596
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 251
EP - 265
BT - ASPLOS 2020 - 25th International Conference on Architectural Support for Programming Languages and Operating Systems
T2 - 25th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2020
Y2 - 16 March 2020 through 20 March 2020
ER -