TY - GEN
T1 - Interleaved Bitstream Execution for Multi-Pattern Regex Matching on GPUs
AU - Ge, Tianao
AU - Chu, Xiaowen
AU - Liu, Hongyuan
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/10/17
Y1 - 2025/10/17
N2 - Pattern matching is a key operation in unstructured data analytics, commonly supported by regular expression (regex) engines. Bit-parallel regex engines compile regexes into bitstream programs, which expose fine-grained parallelism and are well-suited for GPU execution. A straightforward strategy executes each bitstream instruction sequentially, processing all data blocks in a loop. However, this execution suffers from poor data reuse and high memory consumption, limiting throughput. Our key insight is to adopt an interleaved execution model, where all bitstream instructions are fused into a single loop and executed block-wise. While interleaved execution could improve data reuse, enabling it on GPUs is non-trivial due to cross-block data dependencies. To address this, we introduce 1) Dependency-Aware Thread-Data Mapping, which resolves cross-block dependencies via selective recomputation. We further improve interleaved execution performance with two additional optimizations: 2) Shift Rebalancing, which balances dependency chains to reduce synchronization barriers; and 3) Zero Block Skipping, which exploits bitstream sparsity to skip computation on zero blocks. Together, these techniques make interleaved execution practical and efficient. Experiments on real-world regex benchmarks demonstrate a 19.5 × geometric mean speedup over the state-of-the-art GPU regex engine.
AB - Pattern matching is a key operation in unstructured data analytics, commonly supported by regular expression (regex) engines. Bit-parallel regex engines compile regexes into bitstream programs, which expose fine-grained parallelism and are well-suited for GPU execution. A straightforward strategy executes each bitstream instruction sequentially, processing all data blocks in a loop. However, this execution suffers from poor data reuse and high memory consumption, limiting throughput. Our key insight is to adopt an interleaved execution model, where all bitstream instructions are fused into a single loop and executed block-wise. While interleaved execution could improve data reuse, enabling it on GPUs is non-trivial due to cross-block data dependencies. To address this, we introduce 1) Dependency-Aware Thread-Data Mapping, which resolves cross-block dependencies via selective recomputation. We further improve interleaved execution performance with two additional optimizations: 2) Shift Rebalancing, which balances dependency chains to reduce synchronization barriers; and 3) Zero Block Skipping, which exploits bitstream sparsity to skip computation on zero blocks. Together, these techniques make interleaved execution practical and efficient. Experiments on real-world regex benchmarks demonstrate a 19.5 × geometric mean speedup over the state-of-the-art GPU regex engine.
KW - Bit Parallelism
KW - Bitstream
KW - Compiler
KW - GPU
KW - Regex Matching
UR - https://www.scopus.com/pages/publications/105021327596
UR - https://www.scopus.com/pages/publications/105021327596#tab=citedBy
U2 - 10.1145/3725843.3756052
DO - 10.1145/3725843.3756052
M3 - Conference contribution
AN - SCOPUS:105021327596
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 385
EP - 400
BT - MICRO 2025 - 58th IEEE/ACM International Symposium on Microarchitecture
T2 - 58th IEEE/ACM International Symposium on Microarchitecture , MICRO 2025
Y2 - 18 October 2025 through 22 October 2025
ER -