TY - GEN
T1 - Automatically and Adaptively Identifying Severe Alerts for Online Service Systems
AU - Zhao, Nengwen
AU - Jin, Panshi
AU - Wang, Lixin
AU - Yang, Xiaoqin
AU - Liu, Rong
AU - Zhang, Wenchi
AU - Sui, Kaixin
AU - Pei, Dan
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/7
Y1 - 2020/7
N2 - In large-scale online service system, to enhance the quality of services, engineers need to collect various monitoring data and write many rules to trigger alerts. However, the number of alerts is way more than what on-call engineers can properly investigate. Thus, in practice, alerts are classified into several priority levels using manual rules, and on-call engineers primarily focus on handling the alerts with the highest priority level (i.e., severe alerts). Unfortunately, due to the complex and dynamic nature of the online services, this rule-based approach results in missed severe alerts or wasted troubleshooting time on non-severe alerts. In this paper, we propose AlertRank, an automatic and adaptive framework for identifying severe alerts. Specifically, AlertRank extracts a set of powerful and interpretable features (textual and temporal alert features, univariate and multivariate anomaly features for monitoring metrics), adopts XGBoost ranking algorithm to identify the severe alerts out of all incoming alerts, and uses novel methods to obtain labels for both training and testing. Experiments on the datasets from a top global commercial bank demonstrate that AlertRank is effective and achieves the F1-score of 0.89 on average, outperforming all baselines. The feedback from practice shows AlertRank can significantly save the manual efforts for on-call engineers.
AB - In large-scale online service system, to enhance the quality of services, engineers need to collect various monitoring data and write many rules to trigger alerts. However, the number of alerts is way more than what on-call engineers can properly investigate. Thus, in practice, alerts are classified into several priority levels using manual rules, and on-call engineers primarily focus on handling the alerts with the highest priority level (i.e., severe alerts). Unfortunately, due to the complex and dynamic nature of the online services, this rule-based approach results in missed severe alerts or wasted troubleshooting time on non-severe alerts. In this paper, we propose AlertRank, an automatic and adaptive framework for identifying severe alerts. Specifically, AlertRank extracts a set of powerful and interpretable features (textual and temporal alert features, univariate and multivariate anomaly features for monitoring metrics), adopts XGBoost ranking algorithm to identify the severe alerts out of all incoming alerts, and uses novel methods to obtain labels for both training and testing. Experiments on the datasets from a top global commercial bank demonstrate that AlertRank is effective and achieves the F1-score of 0.89 on average, outperforming all baselines. The feedback from practice shows AlertRank can significantly save the manual efforts for on-call engineers.
UR - http://www.scopus.com/inward/record.url?scp=85090282721&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090282721&partnerID=8YFLogxK
U2 - 10.1109/INFOCOM41043.2020.9155219
DO - 10.1109/INFOCOM41043.2020.9155219
M3 - Conference contribution
AN - SCOPUS:85090282721
T3 - Proceedings - IEEE INFOCOM
SP - 2420
EP - 2429
BT - INFOCOM 2020 - IEEE Conference on Computer Communications
T2 - 38th IEEE Conference on Computer Communications, INFOCOM 2020
Y2 - 6 July 2020 through 9 July 2020
ER -