Distributed Machine Learning with a Serverless Architecture

Hao Wang, Di Niu, Baochun Li

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

113 Scopus citations

Abstract

The need to scale up machine learning, in the presence of a rapid growth of data both in volume and in variety, has sparked broad interests to develop distributed machine learning systems, typically based on parameter servers. However, since these systems are based on a dedicated cluster of physical or virtual machines, they have posed non-trivial cluster management overhead to machine learning practitioners and data scientists. In addition, there exists an inherent mismatch between the dynamically varying resource demands during a model training job and the inflexible resource provisioning model of current cluster-based systems.In this paper, we propose SIREN, an asynchronous distributed machine learning framework based on the emerging serverless architecture, with which stateless functions can be executed in the cloud without the complexity of building and maintaining virtual machine infrastructures. With SIREN, we are able to achieve a higher level of parallelism and elasticity by using a swarm of stateless functions, each working on a different batch of data, while greatly reducing system configuration overhead. Furthermore, we propose a scheduler based on Deep Reinforcement Learning to dynamically control the number and memory size of the stateless functions that should be used in each training epoch. The scheduler learns from the training process itself, in pursuit for the minimum possible training time given a cost. With our real-world prototype implementation on AWS Lambda, extensive experimental results have shown that SIREN can reduce model training time by up to 44%, as compared to traditional machine learning training benchmarks on AWS EC2 at the same cost.

Original languageEnglish
Title of host publicationINFOCOM 2019 - IEEE Conference on Computer Communications
Pages1288-1296
Number of pages9
ISBN (Electronic)9781728105154
DOIs
StatePublished - Apr 2019
Event2019 IEEE Conference on Computer Communications, INFOCOM 2019 - Paris, France
Duration: 29 Apr 20192 May 2019

Publication series

NameProceedings - IEEE INFOCOM
Volume2019-April
ISSN (Print)0743-166X

Conference

Conference2019 IEEE Conference on Computer Communications, INFOCOM 2019
Country/TerritoryFrance
CityParis
Period29/04/192/05/19

Fingerprint

Dive into the research topics of 'Distributed Machine Learning with a Serverless Architecture'. Together they form a unique fingerprint.

Cite this