Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus crit...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-03 |
---|---|
Hauptverfasser: | , , , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Kuang, Jinxi Liu, Jinyang Huang, Junjie Zhong, Renyi Gu, Jiazhen Yu, Lan Tan, Rui Yang, Zengyin Lyu, Michael R |
description | Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X. |
doi_str_mv | 10.48550/arxiv.2403.06485 |
format | Article |
fullrecord | <record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2403_06485</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2955958394</sourcerecordid><originalsourceid>FETCH-LOGICAL-a524-4fa68974b55383ffb99f1b3961d626c2ae786229b3f3b5103f60c340e48566bd3</originalsourceid><addsrcrecordid>eNotj1FLwzAURoMgOOZ-gE8GfO5Mc5Os8a0MdWJBxL2XmzapHV07k865f2_cfLpwOXycQ8hNyuYik5Ldo_9pv-dcMJgzFV8XZMIB0iQTnF-RWQgbxhhXCy4lTMj7az8cOls3NsEDekvzzvqR5k3jbYNjO_S07WmBPgKhws7SZTfsa_pxDKPdhgeKdHU0vq1pvtv5AavPa3LpsAt29n-nZP30uF6ukuLt-WWZFwlKLhLhUGV6IUy0yMA5o7VLDWiV1oqriqNdZIpzbcCBkSkDp1gFgtlYpJSpYUpuz7On3nLn2y36Y_nXXZ66I3F3JqLX196GsdwMe99Hp5JrKbXMQAv4Ben6WY0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2955958394</pqid></control><display><type>article</type><title>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Kuang, Jinxi ; Liu, Jinyang ; Huang, Junjie ; Zhong, Renyi ; Gu, Jiazhen ; Yu, Lan ; Tan, Rui ; Yang, Zengyin ; Lyu, Michael R</creator><creatorcontrib>Kuang, Jinxi ; Liu, Jinyang ; Huang, Junjie ; Zhong, Renyi ; Gu, Jiazhen ; Yu, Lan ; Tan, Rui ; Yang, Zengyin ; Lyu, Michael R</creatorcontrib><description>Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2403.06485</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Cloud computing ; Computer Science - Computation and Language ; Computer Science - Learning ; Computer Science - Software Engineering ; Correlation ; Harnesses ; Hybrid systems ; Large language models ; Modules ; Reasoning ; Root cause analysis ; Semantics ; Similarity ; Statistical methods</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,782,883,27912</link.rule.ids><backlink>$$Uhttps://doi.org/10.1145/3639477.3639745$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.06485$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kuang, Jinxi</creatorcontrib><creatorcontrib>Liu, Jinyang</creatorcontrib><creatorcontrib>Huang, Junjie</creatorcontrib><creatorcontrib>Zhong, Renyi</creatorcontrib><creatorcontrib>Gu, Jiazhen</creatorcontrib><creatorcontrib>Yu, Lan</creatorcontrib><creatorcontrib>Tan, Rui</creatorcontrib><creatorcontrib>Yang, Zengyin</creatorcontrib><creatorcontrib>Lyu, Michael R</creatorcontrib><title>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</title><title>arXiv.org</title><description>Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.</description><subject>Cloud computing</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Software Engineering</subject><subject>Correlation</subject><subject>Harnesses</subject><subject>Hybrid systems</subject><subject>Large language models</subject><subject>Modules</subject><subject>Reasoning</subject><subject>Root cause analysis</subject><subject>Semantics</subject><subject>Similarity</subject><subject>Statistical methods</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotj1FLwzAURoMgOOZ-gE8GfO5Mc5Os8a0MdWJBxL2XmzapHV07k865f2_cfLpwOXycQ8hNyuYik5Ldo_9pv-dcMJgzFV8XZMIB0iQTnF-RWQgbxhhXCy4lTMj7az8cOls3NsEDekvzzvqR5k3jbYNjO_S07WmBPgKhws7SZTfsa_pxDKPdhgeKdHU0vq1pvtv5AavPa3LpsAt29n-nZP30uF6ukuLt-WWZFwlKLhLhUGV6IUy0yMA5o7VLDWiV1oqriqNdZIpzbcCBkSkDp1gFgtlYpJSpYUpuz7On3nLn2y36Y_nXXZ66I3F3JqLX196GsdwMe99Hp5JrKbXMQAv4Ben6WY0</recordid><startdate>20240311</startdate><enddate>20240311</enddate><creator>Kuang, Jinxi</creator><creator>Liu, Jinyang</creator><creator>Huang, Junjie</creator><creator>Zhong, Renyi</creator><creator>Gu, Jiazhen</creator><creator>Yu, Lan</creator><creator>Tan, Rui</creator><creator>Yang, Zengyin</creator><creator>Lyu, Michael R</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240311</creationdate><title>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</title><author>Kuang, Jinxi ; Liu, Jinyang ; Huang, Junjie ; Zhong, Renyi ; Gu, Jiazhen ; Yu, Lan ; Tan, Rui ; Yang, Zengyin ; Lyu, Michael R</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a524-4fa68974b55383ffb99f1b3961d626c2ae786229b3f3b5103f60c340e48566bd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cloud computing</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Software Engineering</topic><topic>Correlation</topic><topic>Harnesses</topic><topic>Hybrid systems</topic><topic>Large language models</topic><topic>Modules</topic><topic>Reasoning</topic><topic>Root cause analysis</topic><topic>Semantics</topic><topic>Similarity</topic><topic>Statistical methods</topic><toplevel>online_resources</toplevel><creatorcontrib>Kuang, Jinxi</creatorcontrib><creatorcontrib>Liu, Jinyang</creatorcontrib><creatorcontrib>Huang, Junjie</creatorcontrib><creatorcontrib>Zhong, Renyi</creatorcontrib><creatorcontrib>Gu, Jiazhen</creatorcontrib><creatorcontrib>Yu, Lan</creatorcontrib><creatorcontrib>Tan, Rui</creatorcontrib><creatorcontrib>Yang, Zengyin</creatorcontrib><creatorcontrib>Lyu, Michael R</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kuang, Jinxi</au><au>Liu, Jinyang</au><au>Huang, Junjie</au><au>Zhong, Renyi</au><au>Gu, Jiazhen</au><au>Yu, Lan</au><au>Tan, Rui</au><au>Yang, Zengyin</au><au>Lyu, Michael R</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</atitle><jtitle>arXiv.org</jtitle><date>2024-03-11</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2403.06485</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-03 |
issn | 2331-8422 |
language | eng |
recordid | cdi_arxiv_primary_2403_06485 |
source | arXiv.org; Free E- Journals |
subjects | Cloud computing Computer Science - Computation and Language Computer Science - Learning Computer Science - Software Engineering Correlation Harnesses Hybrid systems Large language models Modules Reasoning Root cause analysis Semantics Similarity Statistical methods |
title | Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T17%3A25%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Knowledge-aware%20Alert%20Aggregation%20in%20Large-scale%20Cloud%20Systems:%20a%20Hybrid%20Approach&rft.jtitle=arXiv.org&rft.au=Kuang,%20Jinxi&rft.date=2024-03-11&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2403.06485&rft_dat=%3Cproquest_arxiv%3E2955958394%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2955958394&rft_id=info:pmid/&rfr_iscdi=true |