TY - GEN
T1 - Tiered sampling
T2 - 5th IEEE International Conference on Big Data, Big Data 2017
AU - de Stefani, Lorenzo
AU - Terolli, Erisa
AU - Upfal, Eli
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/1
Y1 - 2017/7/1
N2 - —We introduce Tiered Sampling, a novel technique for approximate counting sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M, which can be magnitudes smaller than the number of edges. Our methods addresses the challenging task of counting sparse motifs - sub-graph patterns that have low probability to appear in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count we partition the available memory to tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. We demonstrate the advantage of our method in the specific applications of counting sparse 4 and 5-cliques in massive graphs. We present a complete analytical analysis and extensive experimental results using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.
AB - —We introduce Tiered Sampling, a novel technique for approximate counting sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M, which can be magnitudes smaller than the number of edges. Our methods addresses the challenging task of counting sparse motifs - sub-graph patterns that have low probability to appear in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count we partition the available memory to tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. We demonstrate the advantage of our method in the specific applications of counting sparse 4 and 5-cliques in massive graphs. We present a complete analytical analysis and extensive experimental results using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.
KW - Graph motif mining
KW - Reservoir sampling
KW - Stream computing
UR - http://www.scopus.com/inward/record.url?scp=85078537288&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078537288&partnerID=8YFLogxK
U2 - 10.1109/BigData.2017.8257993
DO - 10.1109/BigData.2017.8257993
M3 - Conference contribution
AN - SCOPUS:85078537288
T3 - Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017
SP - 776
EP - 786
BT - Proceedings - 2017 IEEE International Conference on Big Data, Big Data 2017
A2 - Nie, Jian-Yun
A2 - Obradovic, Zoran
A2 - Suzumura, Toyotaro
A2 - Ghosh, Rumi
A2 - Nambiar, Raghunath
A2 - Wang, Chonggang
A2 - Zang, Hui
A2 - Baeza-Yates, Ricardo
A2 - Baeza-Yates, Ricardo
A2 - Hu, Xiaohua
A2 - Kepner, Jeremy
A2 - Cuzzocrea, Alfredo
A2 - Tang, Jian
A2 - Toyoda, Masashi
Y2 - 11 December 2017 through 14 December 2017
ER -