In recent years, various data clustering algorithms have been proposed in the data mining and engineering communities. However, there are still drawbacks in traditional clustering methods which are ...worth to be further investigated, such as clustering for the high dimensional data, learning an ideal affinity matrix which optimally reveals the global data structure, discovering the intrinsic geometrical and discriminative properties of the data space, and reducing the noises influence brings by the complex data input. In this paper, we propose a novel clustering algorithm called robust dual clustering with adaptive manifold regularization (RDC), which simultaneously performs dual matrix factorization tasks with the target of an identical cluster indicator in both of the original and projected feature spaces, respectively. Among which, the l 2,1 -norm is used instead of the conventional l 2 -norm to measure the loss, which helps to improve the model robustness by relieving the influences by the noises and outliers. In order to better consider the intrinsic geometrical and discriminative data structure, we incorporate the manifold regularization term on the cluster indicator by using a particularly learned affinity matrix which is more suitable for the clustering task. Moreover, a novel augmented lagrangian method (ALM) based procedure is designed to effectively and efficiently seek the optimal solution of the proposed RDC optimization. Numerous experiments on the representative data sets demonstrate the superior performance of the proposed method compares to the existing clustering algorithms.
To ensure undisrupted business, large Internet companies need to closely monitor various KPIs (e.g., Page Views, number of online users, and number of orders) of its Web applications, to accurately ...detect anomalies and trigger timely troubleshooting/mitigation. However, anomaly detection for these seasonal KPIs with various patterns and data quality has been a great challenge, especially without labels. In this paper, we proposed Donut, an unsupervised anomaly detection algorithm based on VAE. Thanks to a few of our key techniques, Donut greatly outperforms a state-of-arts supervised ensemble approach and a baseline VAE approach, and its best F-scores range from 0.75 to 0.9 for the studied KPIs from a top global Internet company. We come up with a novel KDE interpretation of reconstruction for Donut, making it the first VAE-based anomaly detection algorithm with solid theoretical explanation.
Modern service systems are constantly improving with the development of various IT technologies, leading to a boost in system scales and complex dependencies among service components. The large scale ...and complexity of services make them more prone to failure. To maintain services’ normal and stable operation, alert and incident management (AIM), which analyzes and handles service failures in time, has become an important content of IT service management (ITSM). Many intelligent solutions have been proposed to improve the management process. However, there is currently no comprehensive survey that systematically reviews related works. Moreover, no integrated AIM architecture can cover each detailed process or most existing piecemeal solutions. Therefore, we conduct an in-depth survey to address these problems. To the best of our knowledge, the paper is the most comprehensive survey on intelligent AIM in IT services. Through this survey, we make the following contributions. First, we summarize an integrated architecture that includes detailed AIM processes and key techniques. Second, we provide a systematic review of related works based on the architecture. Third, we give a valuable analysis of current challenges and trends in AIM.
Key performance indicator (KPI) anomaly detection (AD) is critical to ensure service quality and reliability. Due to the effects of work days, off days, festivals, and business activities on user ...behavior, KPIs may exhibit different patterns within different days, which we call periodicity profiles of KPIs. However, existing KPI AD approaches have difficulties in adapting to diverse periodicity profiles due to the lack of generality. In this paper, we propose an automatic and generic framework called Period , which can accurately detect the periodicity profiles through daily subsequences clustering, and improve the performance of AD methods by robustly and automatically adapting to different periodicity profiles. In our evaluation using several real-world KPIs with different periodicity profiles from large Internet-based services, the clustering algorithm used to detect periodicity can achieve about 0.95 accuracy on average. More importantly, further evaluation on 56 KPIs shows that Period can significantly improve the best <inline-formula> <tex-math notation="LaTeX">{F} </tex-math></inline-formula>-score of several widely used AD approaches by up to 0.66.
In large-scale online service system, to enhance the quality of services, engineers need to collect various monitoring data and write many rules to trigger alerts. However, the number of alerts is ...way more than what on-call engineers can properly investigate. Thus, in practice, alerts are classified into several priority levels using manual rules, and on-call engineers primarily focus on handling the alerts with the highest priority level (i.e., severe alerts). Unfortunately, due to the complex and dynamic nature of the online services, this rule-based approach results in missed severe alerts or wasted troubleshooting time on non-severe alerts. In this paper, we propose AlertRank, an automatic and adaptive framework for identifying severe alerts. Specifically, AlertRank extracts a set of powerful and interpretable features (textual and temporal alert features, univariate and multivariate anomaly features for monitoring metrics), adopts XGBoost ranking algorithm to identify the severe alerts out of all incoming alerts, and uses novel methods to obtain labels for both training and testing. Experiments on the datasets from a top global commercial bank demonstrate that AlertRank is effective and achieves the F1-score of 0.89 on average, outperforming all baselines. The feedback from practice shows AlertRank can significantly save the manual efforts for on-call engineers.
KPI (Key Performance Indicator) anomaly detection is critical for Internet-based services to ensure the quality and reliability. However, existing algorithms' performance in reality is far from ...satisfying due to the lack of sufficient KPI anomaly data to help train and evaluate these algorithms. In this paper, we argue that labeling overhead is the main hurdle to obtain such datasets. Thus, we novelly propose a semi-automatic labelling tool called Label-Less, which minimizes the labeling overhead in order to enable an ImageNet-like large-scale KPI anomaly dataset with high-quality ground truth. One novel technique in Label-Less is robust and rapid anomaly similarity search, which saves operators from scanning and checking the long KPIs back and forth for abnormal patterns or label consistency. In our evaluations using 30 real KPIs from a large Internet company, our anomaly similarity search achieves the best F-score of 0.95 on average, and a real-time per-KPI response time (less than 0.5 second). Overall, the feedback from deployment in practice shows that Label-Less can reduce operators' labeling overhead by more than 90%.
Microservice architecture is applied by an increasing number of systems because of its benefits on delivery, scalability, and autonomy. It is essential but challenging to localize root-cause ...microservices promptly when a fault occurs. Traces are helpful for root-cause microservice localization, and thus many recent approaches utilize them. However, these approaches are less practical due to relying on supervision or other unrealistic assumptions. To overcome their limitations, we propose a more practical root-cause microservice localization approach named TraceRCA. The key insight of TraceRCA is that a microservice with more abnormal and less normal traces passing through it is more likely to be the root cause. Based on it, TraceRCA is composed of trace anomaly detection, suspicious microservice set mining and microservice ranking. We conducted experiments on hundreds of injected faults in a widely-used open-source microservice benchmark and a production system. The results show that TraceRCA is effective in various situations. The top-1 accuracy of TraceRCA outperforms the state-of-the-art unsupervised approaches by 44.8%. Besides, TraceRCA is applied in a large commercial bank, and it helps operators localize root causes for real-world faults accurately and efficiently. We also share some lessons learned from our real-world deployment.
To ensure quality of service and user experience, large Internet companies often monitor various Key Performance Indicators (KPIs) of their systems so that they can detect anomalies and identify ...failure in real time. However, due to a large number of various KPIs and the lack of high-quality labels, existing KPI anomaly detection approaches either perform well only on certain types of KPIs or consume excessive resources. Therefore, to realize generic and practical KPI anomaly detection in the real world, we propose a KPI anomaly detection framework named iRRCF-Active, which contains an unsupervised and white-box anomaly detector based on Robust Random Cut Forest (RRCF), and an active learning component. Specifically, we novelly propose an improved RRCF (iRRCF) algorithm to overcome the drawbacks of applying original RRCF in KPI anomaly detection. Besides, we also incorporate the idea of active learning to make our model benefit from high-quality labels given by experienced operators. We conduct extensive experiments on a large-scale public dataset and a private dataset collected from a large commercial bank. The experimental resulta demonstrate that iRRCF-Active performs better than existing traditional statistical methods, unsupervised learning methods and supervised learning methods. Besides, each component in iRRCF-Active has also been demonstrated to be effective and indispensable.
Using large-scale multi-dimensional data for root cause analysis (MDRCA) is vitally important for online software services. It helps operators narrow down the scope of anomalies and failures quickly ...and localize the root cause to a finer granularity. However, most existing MDRCA algorithms can only solve low-dimensional problems. When dealing with high-dimensional data, the complexity of these algorithms would significantly increase, and even some algorithms would no longer work. Intuitively, passing only a subset of attributes rather than full attributes can improve the performance of these MDRCA algorithms. However, it is challenging due to data imbalance and novel root cause attributes. To better understand the problem of root-cause-oriented attribute selection (RCOAS), we conduct a preliminary study based on real-world data. We find that there exist several straightforward rules to filter out some attributes. In addition, we reveal that existing approaches do not fit the requirements of RCOAS. Motivated by the study, we propose an RCOAS approach, RC-LIR, to select a subset of attributes for downstream algorithms. RC-LIR first performs rule-based selection. Then it improves a feature selection algorithm by two strategies, i.e., scaling up imbalanced data and considering the redundant cost. Experiments on 1000 real-world fault cases demonstrate that RC-LIR can achieve an F1-score of 0.88, outper-forming the baseline approaches by at least 0.15. Furthermore, our experiments with four widely adopted MDRCA algorithms show that integrating RC-LIR can lead to more effective and efficient MDRCA.
Alert is a kind of key data source in monitoring system for online service systems, which is used to record the anomalies in service components and report to engineers. In general, the occurrence of ...a service failure tends to be along with a large number of alerts, which is called alert storm. However, alert storm brings great challenges to diagnose the failure, because it is time-consuming and tedious for engineers to investigate such an overwhelming number of alerts manually. To help understand alert storm in practice, we conduct the first empirical study of alert storm based on large-scale real-world alert data and gain some valuable insights. Based on the findings obtained from the study, we propose a novel approach to handling alert storm. Specifically, this approach includes alert storm detection which aims to identify alert storm accurately, and alert storm summary which aims to recommend a small set of representative alerts to engineers for failure diagnosis. Our experimental study on real-world dataset demonstrates that our alert storm detection can achieve high F1-score (larger than 0.9). Besides, our alert storm summary can reduce the number of alerts that need to be examined by more than 98% and discover representative alerts accurately. We have successfully applied our approach to the service maintenance of a large commercial bank (China EverBright Bank), and we also share our success stories and lessons learned in industry.