TY - JOUR
T1 - Cost-efficient data acquisition on online data marketplaces for correlation analysis
AU - Li, Yanying
AU - Sun, Haipei
AU - Dong, Boxiang
AU - Wang, Hui
N1 - Publisher Copyright:
© 2018, VLDB Endowment.
PY - 2018
Y1 - 2018
N2 - Incentivized by the enormous economic profits, the data marketplace platform has been proliferated recently. In this paper, we consider the data marketplace setting where a data shopper would like to buy data instances from the data marketplace for correlation analysis of certain attributes. We assume that the data in the marketplace is dirty and not free. The goal is to find the data instances from a large number of datasets in the marketplace whose join result not only is of high-quality and rich join informativeness, but also delivers the best correlation between the requested attributes. To achieve this goal, we design DANCE, a middleware that provides the desired data acquisition service. DANCE consists of two phases: (1) In the off-line phase, it constructs a two-layer join graph from samples. The join graph includes the information of the datasets in the marketplace at both schema and instance levels; (2) In the on- line phase, it searches for the data instances that satisfy the constraints of data quality, budget, and join informativeness, while maximizing the correlation of source and target attribute sets. We prove that the complexity of the search problem is NP-hard, and design a heuristic algorithm based on Markov chain Monte Carlo (MCMC). Experiment results on two benchmark and one real datasets demonstrate the efficiency and effectiveness of our heuristic data acquisition algorithm.
AB - Incentivized by the enormous economic profits, the data marketplace platform has been proliferated recently. In this paper, we consider the data marketplace setting where a data shopper would like to buy data instances from the data marketplace for correlation analysis of certain attributes. We assume that the data in the marketplace is dirty and not free. The goal is to find the data instances from a large number of datasets in the marketplace whose join result not only is of high-quality and rich join informativeness, but also delivers the best correlation between the requested attributes. To achieve this goal, we design DANCE, a middleware that provides the desired data acquisition service. DANCE consists of two phases: (1) In the off-line phase, it constructs a two-layer join graph from samples. The join graph includes the information of the datasets in the marketplace at both schema and instance levels; (2) In the on- line phase, it searches for the data instances that satisfy the constraints of data quality, budget, and join informativeness, while maximizing the correlation of source and target attribute sets. We prove that the complexity of the search problem is NP-hard, and design a heuristic algorithm based on Markov chain Monte Carlo (MCMC). Experiment results on two benchmark and one real datasets demonstrate the efficiency and effectiveness of our heuristic data acquisition algorithm.
UR - http://www.scopus.com/inward/record.url?scp=85077230592&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85077230592&partnerID=8YFLogxK
U2 - 10.14778/3297753.3297757
DO - 10.14778/3297753.3297757
M3 - Conference article
AN - SCOPUS:85077230592
VL - 12
SP - 362
EP - 375
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 4
T2 - 45th International Conference on Very Large Data Bases, VLDB 2019
Y2 - 26 August 2017 through 30 August 2017
ER -