Optimizing Shuffle in Wide-Area Data Analytics

Shuhao Liu, Hao Wang, Baochun Li

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

15 Scopus citations

Abstract

As increasingly large volumes of raw data are generated at geographically distributed datacenters, they need to be efficiently processed by data analytic jobs spanning multiple datacenters across wide-area networks. Designed for a single datacenter, existing data processing frameworks, such as Apache Spark, are not able to deliver satisfactory performance when these wide-area analytic jobs are executed. As wide-area networks interconnecting datacenters may not be congestion free, there is a compelling need for a new system framework that is optimized for wide-area data analytics. In this paper, we design and implement a new proactive data aggregation framework based on Apache Spark, with a focus on optimizing the network traffic incurred in shuffle stages of data analytic jobs. The objective of this framework is to strategically and proactively aggregate the output data of mapper tasks to a subset of worker datacenters, as a replacement to Spark's original passive fetch mechanism across datacenters. It improves the performance of wide-area analytic jobs by avoiding repetitive data transfers, which improves the utilization of inter-datacenter links. Our extensive experimental results using standard benchmarks across six Amazon EC2 regions have shown that our proposed framework is able to reduce job completion times by up to 73%, as compared to the existing baseline implementation in Spark.

Original languageEnglish
Title of host publicationProceedings - IEEE 37th International Conference on Distributed Computing Systems, ICDCS 2017
EditorsKisung Lee, Ling Liu
Pages560-571
Number of pages12
ISBN (Electronic)9781538617915
DOIs
StatePublished - 13 Jul 2017
Event37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017 - Atlanta, United States
Duration: 5 Jun 20178 Jun 2017

Publication series

NameProceedings - International Conference on Distributed Computing Systems

Conference

Conference37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017
Country/TerritoryUnited States
CityAtlanta
Period5/06/178/06/17

Keywords

  • Inter-datacenter network
  • MapReduce
  • Shuffle

Fingerprint

Dive into the research topics of 'Optimizing Shuffle in Wide-Area Data Analytics'. Together they form a unique fingerprint.

Cite this