Tiered Sampling: An Efficient Method for Counting Sparse Motifs in Massive Graph Streams

Lorenzo De Stefani, Erisa Terolli, Eli Upfal

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

We introduce Tiered Sampling, a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M, which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs - sub-graph patterns - that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest. We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.

Original languageEnglish
Article number3441299
JournalACM Transactions on Knowledge Discovery from Data
Volume15
Issue number5
DOIs
StatePublished - Jun 2021

Keywords

  • Graph motif mining
  • reservoir sampling
  • stream computing

Fingerprint

Dive into the research topics of 'Tiered Sampling: An Efficient Method for Counting Sparse Motifs in Massive Graph Streams'. Together they form a unique fingerprint.

Cite this