Demystifying and Extracting Fault-indicating Information from Logs for Failure Diagnosis
Logs are imperative in the maintenance of online service systems, which often encompass important information for effective failure mitigation. While existing anomaly detection methodologies facilitate the identification of anomalous logs within extensive runtime data, manual investigation of log me...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Logs are imperative in the maintenance of online service systems, which often
encompass important information for effective failure mitigation. While
existing anomaly detection methodologies facilitate the identification of
anomalous logs within extensive runtime data, manual investigation of log
messages by engineers remains essential to comprehend faults, which is
labor-intensive and error-prone. Upon examining the log-based troubleshooting
practices at CloudA, we find that engineers typically prioritize two categories
of log information for diagnosis. These include fault-indicating descriptions,
which record abnormal system events, and fault-indicating parameters, which
specify the associated entities. Motivated by this finding, we propose an
approach to automatically extract such faultindicating information from logs
for fault diagnosis, named LoFI. LoFI comprises two key stages. In the first
stage, LoFI performs coarse-grained filtering to collect logs related to the
faults based on semantic similarity. In the second stage, LoFI leverages a
pre-trained language model with a novel prompt-based tuning method to extract
fine-grained information of interest from the collected logs. We evaluate LoFI
on logs collected from Apache Spark and an industrial dataset from CloudA. The
experimental results demonstrate that LoFI outperforms all baseline methods by
a significant margin, achieving an absolute improvement of 25.8~37.9 in F1 over
the best baseline method, ChatGPT. This highlights the effectiveness of LoFI in
recognizing fault-indicating information. Furthermore, the successful
deployment of LoFI at CloudA and user studies validate the utility of our
method. The code and data are available at
https://github.com/Jun-jie-Huang/LoFI. |
---|---|
DOI: | 10.48550/arxiv.2409.13561 |