Automatically and Adaptively Identifying Severe Alerts for Online Service Systems

Nengwen Zhao, Panshi Jin, Lixin Wang, Xiaoqin Yang, Rong Liu, Wenchi Zhang, Kaixin Sui, Dan Pei

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

29 Scopus citations

Abstract

In large-scale online service system, to enhance the quality of services, engineers need to collect various monitoring data and write many rules to trigger alerts. However, the number of alerts is way more than what on-call engineers can properly investigate. Thus, in practice, alerts are classified into several priority levels using manual rules, and on-call engineers primarily focus on handling the alerts with the highest priority level (i.e., severe alerts). Unfortunately, due to the complex and dynamic nature of the online services, this rule-based approach results in missed severe alerts or wasted troubleshooting time on non-severe alerts. In this paper, we propose AlertRank, an automatic and adaptive framework for identifying severe alerts. Specifically, AlertRank extracts a set of powerful and interpretable features (textual and temporal alert features, univariate and multivariate anomaly features for monitoring metrics), adopts XGBoost ranking algorithm to identify the severe alerts out of all incoming alerts, and uses novel methods to obtain labels for both training and testing. Experiments on the datasets from a top global commercial bank demonstrate that AlertRank is effective and achieves the F1-score of 0.89 on average, outperforming all baselines. The feedback from practice shows AlertRank can significantly save the manual efforts for on-call engineers.

Original languageEnglish
Title of host publicationINFOCOM 2020 - IEEE Conference on Computer Communications
Pages2420-2429
Number of pages10
ISBN (Electronic)9781728164120
DOIs
StatePublished - Jul 2020
Event38th IEEE Conference on Computer Communications, INFOCOM 2020 - Toronto, Canada
Duration: 6 Jul 20209 Jul 2020

Publication series

NameProceedings - IEEE INFOCOM
Volume2020-July
ISSN (Print)0743-166X

Conference

Conference38th IEEE Conference on Computer Communications, INFOCOM 2020
Country/TerritoryCanada
CityToronto
Period6/07/209/07/20

Fingerprint

Dive into the research topics of 'Automatically and Adaptively Identifying Severe Alerts for Online Service Systems'. Together they form a unique fingerprint.

Cite this