TY - GEN
T1 - Ranking anomalies in data centers
AU - Viswanathan, Krishnamurthy
AU - Choudur, Lakshminarayan
AU - Talwar, Vanish
AU - Wang, Chengwei
AU - Macdonald, Greg
AU - Satterfield, Wade
PY - 2012
Y1 - 2012
N2 - Data centers are growing in size and complexity driven by trends such as cloud computing and on-line services. Such large data centers pose several challenges for system management. Key among them is anomaly detection which is required to monitor and analyze metrics across several thousands servers and across multiple layers of abstractions to detect anomalous system behavior. In practice, multiple anomaly detection tools are used to continuously raise alarms across multiple metrics and servers. These alarms include both true positives and false alarms. Administrators and management tools act on these alarms for diagnosis and deeper root cause analysis and take appropriate management actions to mitigate the anomalous behaviors. Given the scale and scope of the system, the administrators and management tools are overwhelmed with the large number of alarms at any given instant, many of which are false alarms. It is therefore necessary to prioritize and rank these alarms, so as to take timely actions that maintain the service level agreements for the data center. Existing techniques for such ranking are ad-hoc and not scalable. We propose ranking windows of monitored metrics based on their probability of occurrence. We explain how these probabilities can be computed based either on the false positive rates for which the accompanying anomaly detectors were designed, or, when available, on the probability models underlying the false positive rates. In the simplest case, the ranking procedure reduces to computing the Z-score of the observed measurements and computing a statistic from a window of Z-scores to use as a basis for ranking. The proposed techniques are reliable, lightweight and easy to deploy in the modern data center. We have validated these techniques on synthetic data containing injected anomalies and on data acquired from production data centers.
AB - Data centers are growing in size and complexity driven by trends such as cloud computing and on-line services. Such large data centers pose several challenges for system management. Key among them is anomaly detection which is required to monitor and analyze metrics across several thousands servers and across multiple layers of abstractions to detect anomalous system behavior. In practice, multiple anomaly detection tools are used to continuously raise alarms across multiple metrics and servers. These alarms include both true positives and false alarms. Administrators and management tools act on these alarms for diagnosis and deeper root cause analysis and take appropriate management actions to mitigate the anomalous behaviors. Given the scale and scope of the system, the administrators and management tools are overwhelmed with the large number of alarms at any given instant, many of which are false alarms. It is therefore necessary to prioritize and rank these alarms, so as to take timely actions that maintain the service level agreements for the data center. Existing techniques for such ranking are ad-hoc and not scalable. We propose ranking windows of monitored metrics based on their probability of occurrence. We explain how these probabilities can be computed based either on the false positive rates for which the accompanying anomaly detectors were designed, or, when available, on the probability models underlying the false positive rates. In the simplest case, the ranking procedure reduces to computing the Z-score of the observed measurements and computing a statistic from a window of Z-scores to use as a basis for ranking. The proposed techniques are reliable, lightweight and easy to deploy in the modern data center. We have validated these techniques on synthetic data containing injected anomalies and on data acquired from production data centers.
UR - http://www.scopus.com/inward/record.url?scp=84864194398&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84864194398&partnerID=8YFLogxK
U2 - 10.1109/NOMS.2012.6211885
DO - 10.1109/NOMS.2012.6211885
M3 - Conference contribution
AN - SCOPUS:84864194398
SN - 9781467302685
T3 - Proceedings of the 2012 IEEE Network Operations and Management Symposium, NOMS 2012
SP - 79
EP - 87
BT - Proceedings of the 2012 IEEE Network Operations and Management Symposium, NOMS 2012
T2 - 2012 IEEE Network Operations and Management Symposium, NOMS 2012
Y2 - 16 April 2012 through 20 April 2012
ER -