ESRO: Experience Assisted Service Reliability against Outages
Modern cloud services are prone to failures due to their complex architecture, making diagnosis a critical process. Site Reliability Engineers (SREs) spend hours leveraging multiple sources of data, including the alerts, error logs, and domain expertise through past experiences to locate the root ca...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Modern cloud services are prone to failures due to their complex
architecture, making diagnosis a critical process. Site Reliability Engineers
(SREs) spend hours leveraging multiple sources of data, including the alerts,
error logs, and domain expertise through past experiences to locate the root
cause(s). These experiences are documented as natural language text in outage
reports for previous outages. However, utilizing the raw yet rich
semi-structured information in the reports systematically is time-consuming.
Structured information, on the other hand, such as alerts that are often used
during fault diagnosis, is voluminous and requires expert knowledge to discern.
Several strategies have been proposed to use each source of data separately for
root cause analysis. In this work, we build a diagnostic service called ESRO
that recommends root causes and remediation for failures by utilizing
structured as well as semi-structured sources of data systematically. ESRO
constructs a causal graph using alerts and a knowledge graph using outage
reports, and merges them in a novel way to form a unified graph during
training. A retrieval-based mechanism is then used to search the unified graph
and rank the likely root causes and remediation techniques based on the alerts
fired during an outage at inference time. Not only the individual alerts, but
their respective importance in predicting an outage group is taken into account
during recommendation. We evaluated our model on several cloud service outages
of a large SaaS enterprise over the course of ~2 years, and obtained an average
improvement of 27% in rouge scores after comparing the likely root causes
against the ground truth over state-of-the-art baselines. We further establish
the effectiveness of ESRO through qualitative analysis on multiple real outage
examples. |
---|---|
DOI: | 10.48550/arxiv.2309.07230 |