TY - GEN
T1 - Dynamic and decentralized global analytics via machine learning
AU - Wang, Hao
AU - Niu, Di
AU - Li, Baochun
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/10/11
Y1 - 2018/10/11
N2 - Operating at a large scale, data analytics has become an essential tool for gaining insights from operational data, such as user online activities. With the volume of data growing exponentially, data analytic jobs have expanded from a single datacenter to multiple geographically distributed datacenters. Unfortunately, designed originally for a single datacenter, the software stack that supports data analytics is oblivious to on-the-fly resource variations on inter-datacenter networks, which negatively affects the performance of analytic queries. Existing solutions that optimize query execution plans before their execution are not able to quickly adapt to resource variations at query runtime. In this paper, we present Turbo, a lightweight and non-intrusive data-driven system that dynamically adjusts query execution plans for geo-distributed analytics in response to runtime resource variations across datacenters. A highlight of Turbo is its ability to use machine learning at runtime to accurately estimate the time cost of query execution plans, so that adjustments can be made when necessary. Turbo is non-intrusive in the sense that it does not require modifications to the existing software stack for data analytics. We have implemented a real-world prototype of Turbo, and evaluated it on a cluster of 33 instances across 8 regions in the Google Cloud platform. Our experimental results have shown that Turbo can achieve a cost estimation accuracy of over 95% and reduce query completion times by 41%.
AB - Operating at a large scale, data analytics has become an essential tool for gaining insights from operational data, such as user online activities. With the volume of data growing exponentially, data analytic jobs have expanded from a single datacenter to multiple geographically distributed datacenters. Unfortunately, designed originally for a single datacenter, the software stack that supports data analytics is oblivious to on-the-fly resource variations on inter-datacenter networks, which negatively affects the performance of analytic queries. Existing solutions that optimize query execution plans before their execution are not able to quickly adapt to resource variations at query runtime. In this paper, we present Turbo, a lightweight and non-intrusive data-driven system that dynamically adjusts query execution plans for geo-distributed analytics in response to runtime resource variations across datacenters. A highlight of Turbo is its ability to use machine learning at runtime to accurately estimate the time cost of query execution plans, so that adjustments can be made when necessary. Turbo is non-intrusive in the sense that it does not require modifications to the existing software stack for data analytics. We have implemented a real-world prototype of Turbo, and evaluated it on a cluster of 33 instances across 8 regions in the Google Cloud platform. Our experimental results have shown that Turbo can achieve a cost estimation accuracy of over 95% and reduce query completion times by 41%.
KW - Data Analytics
KW - Distributed Systems
KW - Machine Learning
UR - http://www.scopus.com/inward/record.url?scp=85058997244&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85058997244&partnerID=8YFLogxK
U2 - 10.1145/3267809.3267812
DO - 10.1145/3267809.3267812
M3 - Conference contribution
AN - SCOPUS:85058997244
T3 - SoCC 2018 - Proceedings of the 2018 ACM Symposium on Cloud Computing
SP - 14
EP - 25
BT - SoCC 2018 - Proceedings of the 2018 ACM Symposium on Cloud Computing
T2 - 2018 ACM Symposium on Cloud Computing, SoCC 2018
Y2 - 11 October 2018 through 13 October 2018
ER -