TY - GEN
T1 - ngAP
T2 - 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024
AU - Ge, Tianao
AU - Zhang, Tong
AU - Liu, Hongyuan
N1 - Publisher Copyright:
© 2024 Association for Computing Machinery. All rights reserved.
PY - 2024/4/27
Y1 - 2024/4/27
N2 - Finite automata serve as compute kernels for various applications that require high throughput. However, despite the increasing compute power of GPUs, their potential in processing automata remains underutilized. In this work, we identify three major challenges that limit GPU throughput. 1) The available parallelism is insufficient, resulting in underutilized GPU threads. 2) Automata workloads involve significant redundant computations since a portion of states matches with repeated symbols. 3) The mapping between threads and states is switched dynamically, leading to poor data locality. Our key insight is that processing automata "one-symbol-at-a-time" serializes the execution, and thus needs to be revamped. To address these challenges, we propose Non-blocking Automata Processing, which allows parallel processing of different symbols in the input stream and also enables further optimizations: 1) We prefetch a portion of computations to increase the chances of processing multiple symbols simultaneously, thereby utilizing GPU threads better. 2) To reduce redundant computations, we store repeated computations in a memoization table, enabling us to substitute them with table lookups. 3) We privatize some computations to preserve the mapping between threads and states, thus improving data locality. Experimental results demonstrate that our approach outperforms the state-of-the-art GPU automata processing engine by an average of 7.9× and up to 901× across 20 applications.
AB - Finite automata serve as compute kernels for various applications that require high throughput. However, despite the increasing compute power of GPUs, their potential in processing automata remains underutilized. In this work, we identify three major challenges that limit GPU throughput. 1) The available parallelism is insufficient, resulting in underutilized GPU threads. 2) Automata workloads involve significant redundant computations since a portion of states matches with repeated symbols. 3) The mapping between threads and states is switched dynamically, leading to poor data locality. Our key insight is that processing automata "one-symbol-at-a-time" serializes the execution, and thus needs to be revamped. To address these challenges, we propose Non-blocking Automata Processing, which allows parallel processing of different symbols in the input stream and also enables further optimizations: 1) We prefetch a portion of computations to increase the chances of processing multiple symbols simultaneously, thereby utilizing GPU threads better. 2) To reduce redundant computations, we store repeated computations in a memoization table, enabling us to substitute them with table lookups. 3) We privatize some computations to preserve the mapping between threads and states, thus improving data locality. Experimental results demonstrate that our approach outperforms the state-of-the-art GPU automata processing engine by an average of 7.9× and up to 901× across 20 applications.
KW - finite state machine
KW - GPU
KW - parallel computing
UR - http://www.scopus.com/inward/record.url?scp=85191488732&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85191488732&partnerID=8YFLogxK
U2 - 10.1145/3617232.3624848
DO - 10.1145/3617232.3624848
M3 - Conference contribution
AN - SCOPUS:85191488732
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 268
EP - 285
BT - Spring Cycle
Y2 - 27 April 2024 through 1 May 2024
ER -