Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach

Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus crit...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-03
Hauptverfasser: Kuang, Jinxi, Liu, Jinyang, Huang, Junjie, Zhong, Renyi, Gu, Jiazhen, Yu, Lan, Tan, Rui, Yang, Zengyin, Lyu, Michael R
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Kuang, Jinxi
Liu, Jinyang
Huang, Junjie
Zhong, Renyi
Gu, Jiazhen
Yu, Lan
Tan, Rui
Yang, Zengyin
Lyu, Michael R
description Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.
doi_str_mv 10.48550/arxiv.2403.06485
format Article
fullrecord <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2403_06485</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2955958394</sourcerecordid><originalsourceid>FETCH-LOGICAL-a524-4fa68974b55383ffb99f1b3961d626c2ae786229b3f3b5103f60c340e48566bd3</originalsourceid><addsrcrecordid>eNotj1FLwzAURoMgOOZ-gE8GfO5Mc5Os8a0MdWJBxL2XmzapHV07k865f2_cfLpwOXycQ8hNyuYik5Ldo_9pv-dcMJgzFV8XZMIB0iQTnF-RWQgbxhhXCy4lTMj7az8cOls3NsEDekvzzvqR5k3jbYNjO_S07WmBPgKhws7SZTfsa_pxDKPdhgeKdHU0vq1pvtv5AavPa3LpsAt29n-nZP30uF6ukuLt-WWZFwlKLhLhUGV6IUy0yMA5o7VLDWiV1oqriqNdZIpzbcCBkSkDp1gFgtlYpJSpYUpuz7On3nLn2y36Y_nXXZ66I3F3JqLX196GsdwMe99Hp5JrKbXMQAv4Ben6WY0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2955958394</pqid></control><display><type>article</type><title>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Kuang, Jinxi ; Liu, Jinyang ; Huang, Junjie ; Zhong, Renyi ; Gu, Jiazhen ; Yu, Lan ; Tan, Rui ; Yang, Zengyin ; Lyu, Michael R</creator><creatorcontrib>Kuang, Jinxi ; Liu, Jinyang ; Huang, Junjie ; Zhong, Renyi ; Gu, Jiazhen ; Yu, Lan ; Tan, Rui ; Yang, Zengyin ; Lyu, Michael R</creatorcontrib><description>Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2403.06485</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Cloud computing ; Computer Science - Computation and Language ; Computer Science - Learning ; Computer Science - Software Engineering ; Correlation ; Harnesses ; Hybrid systems ; Large language models ; Modules ; Reasoning ; Root cause analysis ; Semantics ; Similarity ; Statistical methods</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,782,883,27912</link.rule.ids><backlink>$$Uhttps://doi.org/10.1145/3639477.3639745$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.06485$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kuang, Jinxi</creatorcontrib><creatorcontrib>Liu, Jinyang</creatorcontrib><creatorcontrib>Huang, Junjie</creatorcontrib><creatorcontrib>Zhong, Renyi</creatorcontrib><creatorcontrib>Gu, Jiazhen</creatorcontrib><creatorcontrib>Yu, Lan</creatorcontrib><creatorcontrib>Tan, Rui</creatorcontrib><creatorcontrib>Yang, Zengyin</creatorcontrib><creatorcontrib>Lyu, Michael R</creatorcontrib><title>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</title><title>arXiv.org</title><description>Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.</description><subject>Cloud computing</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Software Engineering</subject><subject>Correlation</subject><subject>Harnesses</subject><subject>Hybrid systems</subject><subject>Large language models</subject><subject>Modules</subject><subject>Reasoning</subject><subject>Root cause analysis</subject><subject>Semantics</subject><subject>Similarity</subject><subject>Statistical methods</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotj1FLwzAURoMgOOZ-gE8GfO5Mc5Os8a0MdWJBxL2XmzapHV07k865f2_cfLpwOXycQ8hNyuYik5Ldo_9pv-dcMJgzFV8XZMIB0iQTnF-RWQgbxhhXCy4lTMj7az8cOls3NsEDekvzzvqR5k3jbYNjO_S07WmBPgKhws7SZTfsa_pxDKPdhgeKdHU0vq1pvtv5AavPa3LpsAt29n-nZP30uF6ukuLt-WWZFwlKLhLhUGV6IUy0yMA5o7VLDWiV1oqriqNdZIpzbcCBkSkDp1gFgtlYpJSpYUpuz7On3nLn2y36Y_nXXZ66I3F3JqLX196GsdwMe99Hp5JrKbXMQAv4Ben6WY0</recordid><startdate>20240311</startdate><enddate>20240311</enddate><creator>Kuang, Jinxi</creator><creator>Liu, Jinyang</creator><creator>Huang, Junjie</creator><creator>Zhong, Renyi</creator><creator>Gu, Jiazhen</creator><creator>Yu, Lan</creator><creator>Tan, Rui</creator><creator>Yang, Zengyin</creator><creator>Lyu, Michael R</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240311</creationdate><title>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</title><author>Kuang, Jinxi ; Liu, Jinyang ; Huang, Junjie ; Zhong, Renyi ; Gu, Jiazhen ; Yu, Lan ; Tan, Rui ; Yang, Zengyin ; Lyu, Michael R</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a524-4fa68974b55383ffb99f1b3961d626c2ae786229b3f3b5103f60c340e48566bd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cloud computing</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Software Engineering</topic><topic>Correlation</topic><topic>Harnesses</topic><topic>Hybrid systems</topic><topic>Large language models</topic><topic>Modules</topic><topic>Reasoning</topic><topic>Root cause analysis</topic><topic>Semantics</topic><topic>Similarity</topic><topic>Statistical methods</topic><toplevel>online_resources</toplevel><creatorcontrib>Kuang, Jinxi</creatorcontrib><creatorcontrib>Liu, Jinyang</creatorcontrib><creatorcontrib>Huang, Junjie</creatorcontrib><creatorcontrib>Zhong, Renyi</creatorcontrib><creatorcontrib>Gu, Jiazhen</creatorcontrib><creatorcontrib>Yu, Lan</creatorcontrib><creatorcontrib>Tan, Rui</creatorcontrib><creatorcontrib>Yang, Zengyin</creatorcontrib><creatorcontrib>Lyu, Michael R</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kuang, Jinxi</au><au>Liu, Jinyang</au><au>Huang, Junjie</au><au>Zhong, Renyi</au><au>Gu, Jiazhen</au><au>Yu, Lan</au><au>Tan, Rui</au><au>Yang, Zengyin</au><au>Lyu, Michael R</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</atitle><jtitle>arXiv.org</jtitle><date>2024-03-11</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2403.06485</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-03
issn 2331-8422
language eng
recordid cdi_arxiv_primary_2403_06485
source arXiv.org; Free E- Journals
subjects Cloud computing
Computer Science - Computation and Language
Computer Science - Learning
Computer Science - Software Engineering
Correlation
Harnesses
Hybrid systems
Large language models
Modules
Reasoning
Root cause analysis
Semantics
Similarity
Statistical methods
title Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T17%3A25%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Knowledge-aware%20Alert%20Aggregation%20in%20Large-scale%20Cloud%20Systems:%20a%20Hybrid%20Approach&rft.jtitle=arXiv.org&rft.au=Kuang,%20Jinxi&rft.date=2024-03-11&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2403.06485&rft_dat=%3Cproquest_arxiv%3E2955958394%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2955958394&rft_id=info:pmid/&rfr_iscdi=true