Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach

Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus crit...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-03
Hauptverfasser:	Kuang, Jinxi, Liu, Jinyang, Huang, Junjie, Zhong, Renyi, Gu, Jiazhen, Yu, Lan, Tan, Rui, Yang, Zengyin, Lyu, Michael R
Format:	Artikel
Sprache:	eng
Schlagworte:	Cloud computing Computer Science - Computation and Language Computer Science - Learning Computer Science - Software Engineering Correlation Harnesses Hybrid systems Large language models Modules Reasoning Root cause analysis Semantics Similarity Statistical methods
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Kuang, Jinxi Liu, Jinyang Huang, Junjie Zhong, Renyi Gu, Jiazhen Yu, Lan Tan, Rui Yang, Zengyin Lyu, Michael R
description	Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.
doi_str_mv	10.48550/arxiv.2403.06485
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_2403_06485</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2955958394</sourcerecordid><originalsourceid>FETCH-LOGICAL-a524-4fa68974b55383ffb99f1b3961d626c2ae786229b3f3b5103f60c340e48566bd3</originalsourceid><addsrcrecordid>eNotj1FLwzAURoMgOOZ-gE8GfO5Mc5Os8a0MdWJBxL2XmzapHV07k865f2_cfLpwOXycQ8hNyuYik5Ldo_9pv-dcMJgzFV8XZMIB0iQTnF-RWQgbxhhXCy4lTMj7az8cOls3NsEDekvzzvqR5k3jbYNjO_S07WmBPgKhws7SZTfsa_pxDKPdhgeKdHU0vq1pvtv5AavPa3LpsAt29n-nZP30uF6ukuLt-WWZFwlKLhLhUGV6IUy0yMA5o7VLDWiV1oqriqNdZIpzbcCBkSkDp1gFgtlYpJSpYUpuz7On3nLn2y36Y_nXXZ66I3F3JqLX196GsdwMe99Hp5JrKbXMQAv4Ben6WY0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2955958394</pqid></control><display><type>article</type><title>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Kuang, Jinxi ; Liu, Jinyang ; Huang, Junjie ; Zhong, Renyi ; Gu, Jiazhen ; Yu, Lan ; Tan, Rui ; Yang, Zengyin ; Lyu, Michael R</creator><creatorcontrib>Kuang, Jinxi ; Liu, Jinyang ; Huang, Junjie ; Zhong, Renyi ; Gu, Jiazhen ; Yu, Lan ; Tan, Rui ; Yang, Zengyin ; Lyu, Michael R</creatorcontrib><description>Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.2403.06485</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Cloud computing ; Computer Science - Computation and Language ; Computer Science - Learning ; Computer Science - Software Engineering ; Correlation ; Harnesses ; Hybrid systems ; Large language models ; Modules ; Reasoning ; Root cause analysis ; Semantics ; Similarity ; Statistical methods</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,778,782,883,27912</link.rule.ids><backlink>$$Uhttps://doi.org/10.1145/3639477.3639745$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.06485$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Kuang, Jinxi</creatorcontrib><creatorcontrib>Liu, Jinyang</creatorcontrib><creatorcontrib>Huang, Junjie</creatorcontrib><creatorcontrib>Zhong, Renyi</creatorcontrib><creatorcontrib>Gu, Jiazhen</creatorcontrib><creatorcontrib>Yu, Lan</creatorcontrib><creatorcontrib>Tan, Rui</creatorcontrib><creatorcontrib>Yang, Zengyin</creatorcontrib><creatorcontrib>Lyu, Michael R</creatorcontrib><title>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</title><title>arXiv.org</title><description>Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.</description><subject>Cloud computing</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Software Engineering</subject><subject>Correlation</subject><subject>Harnesses</subject><subject>Hybrid systems</subject><subject>Large language models</subject><subject>Modules</subject><subject>Reasoning</subject><subject>Root cause analysis</subject><subject>Semantics</subject><subject>Similarity</subject><subject>Statistical methods</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotj1FLwzAURoMgOOZ-gE8GfO5Mc5Os8a0MdWJBxL2XmzapHV07k865f2_cfLpwOXycQ8hNyuYik5Ldo_9pv-dcMJgzFV8XZMIB0iQTnF-RWQgbxhhXCy4lTMj7az8cOls3NsEDekvzzvqR5k3jbYNjO_S07WmBPgKhws7SZTfsa_pxDKPdhgeKdHU0vq1pvtv5AavPa3LpsAt29n-nZP30uF6ukuLt-WWZFwlKLhLhUGV6IUy0yMA5o7VLDWiV1oqriqNdZIpzbcCBkSkDp1gFgtlYpJSpYUpuz7On3nLn2y36Y_nXXZ66I3F3JqLX196GsdwMe99Hp5JrKbXMQAv4Ben6WY0</recordid><startdate>20240311</startdate><enddate>20240311</enddate><creator>Kuang, Jinxi</creator><creator>Liu, Jinyang</creator><creator>Huang, Junjie</creator><creator>Zhong, Renyi</creator><creator>Gu, Jiazhen</creator><creator>Yu, Lan</creator><creator>Tan, Rui</creator><creator>Yang, Zengyin</creator><creator>Lyu, Michael R</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240311</creationdate><title>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</title><author>Kuang, Jinxi ; Liu, Jinyang ; Huang, Junjie ; Zhong, Renyi ; Gu, Jiazhen ; Yu, Lan ; Tan, Rui ; Yang, Zengyin ; Lyu, Michael R</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a524-4fa68974b55383ffb99f1b3961d626c2ae786229b3f3b5103f60c340e48566bd3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cloud computing</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Software Engineering</topic><topic>Correlation</topic><topic>Harnesses</topic><topic>Hybrid systems</topic><topic>Large language models</topic><topic>Modules</topic><topic>Reasoning</topic><topic>Root cause analysis</topic><topic>Semantics</topic><topic>Similarity</topic><topic>Statistical methods</topic><toplevel>online_resources</toplevel><creatorcontrib>Kuang, Jinxi</creatorcontrib><creatorcontrib>Liu, Jinyang</creatorcontrib><creatorcontrib>Huang, Junjie</creatorcontrib><creatorcontrib>Zhong, Renyi</creatorcontrib><creatorcontrib>Gu, Jiazhen</creatorcontrib><creatorcontrib>Yu, Lan</creatorcontrib><creatorcontrib>Tan, Rui</creatorcontrib><creatorcontrib>Yang, Zengyin</creatorcontrib><creatorcontrib>Lyu, Michael R</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kuang, Jinxi</au><au>Liu, Jinyang</au><au>Huang, Junjie</au><au>Zhong, Renyi</au><au>Gu, Jiazhen</au><au>Yu, Lan</au><au>Tan, Rui</au><au>Yang, Zengyin</au><au>Lyu, Michael R</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach</atitle><jtitle>arXiv.org</jtitle><date>2024-03-11</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Due to the scale and complexity of cloud systems, a system failure would trigger an "alert storm", i.e., massive correlated alerts. Although these alerts can be traced back to a few root causes, the overwhelming number makes it infeasible for manual handling. Alert aggregation is thus critical to help engineers concentrate on the root cause and facilitate failure resolution. Existing methods typically utilize semantic similarity-based methods or statistical methods to aggregate alerts. However, semantic similarity-based methods overlook the causal rationale of alerts, while statistical methods can hardly handle infrequent alerts. To tackle these limitations, we introduce leveraging external knowledge, i.e., Standard Operation Procedure (SOP) of alerts as a supplement. We propose COLA, a novel hybrid approach based on correlation mining and LLM (Large Language Model) reasoning for online alert aggregation. The correlation mining module effectively captures the temporal and spatial relations between alerts, measuring their correlations in an efficient manner. Subsequently, only uncertain pairs with low confidence are forwarded to the LLM reasoning module for detailed analysis. This hybrid design harnesses both statistical evidence for frequent alerts and the reasoning capabilities of computationally intensive LLMs, ensuring the overall efficiency of COLA in handling large volumes of alerts in practical scenarios. We evaluate COLA on three datasets collected from the production environment of a large-scale cloud platform. The experimental results show COLA achieves F1-scores from 0.901 to 0.930, outperforming state-of-the-art methods and achieving comparable efficiency. We also share our experience in deploying COLA in our real-world cloud system, Cloud X.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.2403.06485</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-03
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_2403_06485
source	arXiv.org; Free E- Journals
subjects	Cloud computing Computer Science - Computation and Language Computer Science - Learning Computer Science - Software Engineering Correlation Harnesses Hybrid systems Large language models Modules Reasoning Root cause analysis Semantics Similarity Statistical methods
title	Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid Approach
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T17%3A25%3A17IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Knowledge-aware%20Alert%20Aggregation%20in%20Large-scale%20Cloud%20Systems:%20a%20Hybrid%20Approach&rft.jtitle=arXiv.org&rft.au=Kuang,%20Jinxi&rft.date=2024-03-11&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.2403.06485&rft_dat=%3Cproquest_arxiv%3E2955958394%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2955958394&rft_id=info:pmid/&rfr_iscdi=true