TY - GEN
T1 - Optimizing Shuffle in Wide-Area Data Analytics
AU - Liu, Shuhao
AU - Wang, Hao
AU - Li, Baochun
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/13
Y1 - 2017/7/13
N2 - As increasingly large volumes of raw data are generated at geographically distributed datacenters, they need to be efficiently processed by data analytic jobs spanning multiple datacenters across wide-area networks. Designed for a single datacenter, existing data processing frameworks, such as Apache Spark, are not able to deliver satisfactory performance when these wide-area analytic jobs are executed. As wide-area networks interconnecting datacenters may not be congestion free, there is a compelling need for a new system framework that is optimized for wide-area data analytics. In this paper, we design and implement a new proactive data aggregation framework based on Apache Spark, with a focus on optimizing the network traffic incurred in shuffle stages of data analytic jobs. The objective of this framework is to strategically and proactively aggregate the output data of mapper tasks to a subset of worker datacenters, as a replacement to Spark's original passive fetch mechanism across datacenters. It improves the performance of wide-area analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-datacenter links. Our extensive experimental results using standard benchmarks across six Amazon EC2 regions have shown that our proposed framework is able to reduce job completion times by up to 73%, as compared to the existing baseline implementation in Spark.
AB - As increasingly large volumes of raw data are generated at geographically distributed datacenters, they need to be efficiently processed by data analytic jobs spanning multiple datacenters across wide-area networks. Designed for a single datacenter, existing data processing frameworks, such as Apache Spark, are not able to deliver satisfactory performance when these wide-area analytic jobs are executed. As wide-area networks interconnecting datacenters may not be congestion free, there is a compelling need for a new system framework that is optimized for wide-area data analytics. In this paper, we design and implement a new proactive data aggregation framework based on Apache Spark, with a focus on optimizing the network traffic incurred in shuffle stages of data analytic jobs. The objective of this framework is to strategically and proactively aggregate the output data of mapper tasks to a subset of worker datacenters, as a replacement to Spark's original passive fetch mechanism across datacenters. It improves the performance of wide-area analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-datacenter links. Our extensive experimental results using standard benchmarks across six Amazon EC2 regions have shown that our proposed framework is able to reduce job completion times by up to 73%, as compared to the existing baseline implementation in Spark.
KW - Inter-datacenter network
KW - MapReduce
KW - Shuffle
UR - http://www.scopus.com/inward/record.url?scp=85027281279&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85027281279&partnerID=8YFLogxK
U2 - 10.1109/ICDCS.2017.131
DO - 10.1109/ICDCS.2017.131
M3 - Conference contribution
AN - SCOPUS:85027281279
T3 - Proceedings - International Conference on Distributed Computing Systems
SP - 560
EP - 571
BT - Proceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017
A2 - Lee, Kisung
A2 - Liu, Ling
T2 - 37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017
Y2 - 5 June 2017 through 8 June 2017
ER -