TY - JOUR
T1 - Turbo
T2 - Dynamic and Decentralized Global Analytics via Machine Learning
AU - Wang, Hao
AU - Niu, Di
AU - Li, Baochun
N1 - Publisher Copyright:
© 1990-2012 IEEE.
PY - 2020/6/1
Y1 - 2020/6/1
N2 - Big data analytics are practiced in many fields to extract insights from massive amounts of data. With exponential growth in both the volume and variety of data, analytic queries have expanded from those executed in a single datacenter to those requiring inputs from multiple datacenters that are geographically separate or even globally distributed. Unfortunately, the software stack that supports data analytics is designed originally for a cluster environment and is not tailored to execute global analytic queries, where resources such as inter-datacenter networks may vary on the fly. Existing optimization strategies that determine the query execution plan before its execution are not able to adapt to resource variations at query runtime. In this article, we present Turbo, a lightweight and non-intrusive global analytics system that can dynamically adjust query execution plans for geo-distributed analytics in the presence of time-varying resources and network bandwidth across datacenters. Turbo uses machine learning to accurately predict the time cost of a query execution plan so that dynamic adjustments can be made to it when necessary. Turbo is non-intrusive in the sense that it does not require modifications to the existing software stack for data analytics. We have implemented a real-world prototype of Turbo, and evaluated it on a cluster of 33 instances across eight regions in the Google Cloud Platform. Our experimental results have shown that Turbo can achieve an accuracy of 95 percent for estimating time costs, and can reduce the query completion time by 41 percent.
AB - Big data analytics are practiced in many fields to extract insights from massive amounts of data. With exponential growth in both the volume and variety of data, analytic queries have expanded from those executed in a single datacenter to those requiring inputs from multiple datacenters that are geographically separate or even globally distributed. Unfortunately, the software stack that supports data analytics is designed originally for a cluster environment and is not tailored to execute global analytic queries, where resources such as inter-datacenter networks may vary on the fly. Existing optimization strategies that determine the query execution plan before its execution are not able to adapt to resource variations at query runtime. In this article, we present Turbo, a lightweight and non-intrusive global analytics system that can dynamically adjust query execution plans for geo-distributed analytics in the presence of time-varying resources and network bandwidth across datacenters. Turbo uses machine learning to accurately predict the time cost of a query execution plan so that dynamic adjustments can be made to it when necessary. Turbo is non-intrusive in the sense that it does not require modifications to the existing software stack for data analytics. We have implemented a real-world prototype of Turbo, and evaluated it on a cluster of 33 instances across eight regions in the Google Cloud Platform. Our experimental results have shown that Turbo can achieve an accuracy of 95 percent for estimating time costs, and can reduce the query completion time by 41 percent.
KW - Data analytics
KW - distributed systems
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85078944130&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078944130&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2020.2964667
DO - 10.1109/TPDS.2020.2964667
M3 - Article
AN - SCOPUS:85078944130
SN - 1045-9219
VL - 31
SP - 1372
EP - 1386
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 6
M1 - 8951093
ER -