TY - JOUR
T1 - Generating realistic synthetic population datasets
AU - Wu, Hao
AU - Ning, Yue
AU - Chakraborty, Prithwish
AU - Vreeken, Jilles
AU - Tatti, Nikolaj
AU - Ramakrishnan, Naren
N1 - Publisher Copyright:
© 2018 ACM.
PY - 2018/7
Y1 - 2018/7
N2 - Modern studies of societal phenomena rely on the availability of large datasets capturing attributes and activities of synthetic, city-level, populations. For instance, in epidemiology, synthetic population datasets are necessary to study disease propagation and intervention measures before implementation. In social science, synthetic population datasets are needed to understand how policy decisions might affect preferences and behaviors of individuals. In public health, synthetic population datasets are necessary to capture diagnostic and procedural characteristics of patient records without violating confidentialities of individuals. To generate such datasets over a large set of categorical variables, we propose the use of the maximum entropy principle to formalize a generative model such that in a statistically well-founded way we can optimally utilize given prior information about the data, and are unbiased otherwise. An efficient inference algorithm is designed to estimate the maximum entropy model, and we demonstrate how our approach is adept at estimating underlying data distributions. We evaluate this approach against both simulated data and US census datasets, and demonstrate its feasibility using an epidemic simulation application.
AB - Modern studies of societal phenomena rely on the availability of large datasets capturing attributes and activities of synthetic, city-level, populations. For instance, in epidemiology, synthetic population datasets are necessary to study disease propagation and intervention measures before implementation. In social science, synthetic population datasets are needed to understand how policy decisions might affect preferences and behaviors of individuals. In public health, synthetic population datasets are necessary to capture diagnostic and procedural characteristics of patient records without violating confidentialities of individuals. To generate such datasets over a large set of categorical variables, we propose the use of the maximum entropy principle to formalize a generative model such that in a statistically well-founded way we can optimally utilize given prior information about the data, and are unbiased otherwise. An efficient inference algorithm is designed to estimate the maximum entropy model, and we demonstrate how our approach is adept at estimating underlying data distributions. We evaluate this approach against both simulated data and US census datasets, and demonstrate its feasibility using an epidemic simulation application.
KW - Maximum entropy models
KW - Multivariate categorical data
KW - Probabilistic modeling
KW - Synthetic population
UR - http://www.scopus.com/inward/record.url?scp=85052552643&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85052552643&partnerID=8YFLogxK
U2 - 10.1145/3182383
DO - 10.1145/3182383
M3 - Article
AN - SCOPUS:85052552643
SN - 1556-4681
VL - 12
JO - ACM Transactions on Knowledge Discovery from Data
JF - ACM Transactions on Knowledge Discovery from Data
IS - 4
M1 - a45
ER -