CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms
30th ACM International Conference on Information and Knowledge Management (CIKM 2021) As business of Alibaba expands across the world among various industries, higher standards are imposed on the service quality and reliability of big data cloud computing platforms which constitute the infrastructur...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Zhang, Yingying Guan, Zhengxiong Qian, Huajie Xu, Leili Liu, Hengbo Wen, Qingsong Sun, Liang Jiang, Junwei Fan, Lunting Ke, Min |
description | 30th ACM International Conference on Information and Knowledge
Management (CIKM 2021) As business of Alibaba expands across the world among various industries,
higher standards are imposed on the service quality and reliability of big data
cloud computing platforms which constitute the infrastructure of Alibaba Cloud.
However, root cause analysis in these platforms is non-trivial due to the
complicated system architecture. In this paper, we propose a root cause
analysis framework called CloudRCA which makes use of heterogeneous
multi-source data including Key Performance Indicators (KPIs), logs, as well as
topology, and extracts important features via state-of-the-art anomaly
detection and log analysis techniques. The engineered features are then
utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to
infer root causes with high accuracy and efficiency. Ablation study and
comprehensive experimental comparisons demonstrate that, compared to existing
frameworks, CloudRCA 1) consistently outperforms existing approaches in
f1-score across different cloud systems; 2) can handle novel types of root
causes thanks to the hierarchical structure of KHBN; 3) performs more robustly
with respect to algorithmic configurations; and 4) scales more favorably in the
data and feature sizes. Experiments also show that a cross-platform transfer
learning mechanism can be adopted to further improve the accuracy by more than
10\%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud
and employed in three typical cloud computing platforms including MaxCompute,
Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more
than $20\%$ in the time spent on resolving failures in the past twelve months
and improves service reliability significantly. |
doi_str_mv | 10.48550/arxiv.2111.03753 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2111_03753</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2111_03753</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-9fe3f5ef8e2f44f3b25cbf086346e685e34751ca2263bf3362d63cba4cd41463</originalsourceid><addsrcrecordid>eNotz1FLwzAUBeC87EGmP8An8wfaNblJWn0rwU1hoMy9l9s2dxTbZSStun-_WX06cDgc-Bi7F1mqCq2zFYaf7iuVQog0g1zDDdvY3k_tzpZPvOQ770ducYqOl0fsz7GLfB1wcN8-fHLygc9rbv1wmsbueODvPY7Xfoi3bEHYR3f3n0v2sX7e25dk-7Z5teU2QZND8kgOSDsqnCSlCGqpm5qywoAyzhTagcq1aFBKAzUBGNkaaGpUTauEMrBkD3-vM6Q6hW7AcK5-QdUMggtbj0Uq</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms</title><source>arXiv.org</source><creator>Zhang, Yingying ; Guan, Zhengxiong ; Qian, Huajie ; Xu, Leili ; Liu, Hengbo ; Wen, Qingsong ; Sun, Liang ; Jiang, Junwei ; Fan, Lunting ; Ke, Min</creator><creatorcontrib>Zhang, Yingying ; Guan, Zhengxiong ; Qian, Huajie ; Xu, Leili ; Liu, Hengbo ; Wen, Qingsong ; Sun, Liang ; Jiang, Junwei ; Fan, Lunting ; Ke, Min</creatorcontrib><description>30th ACM International Conference on Information and Knowledge
Management (CIKM 2021) As business of Alibaba expands across the world among various industries,
higher standards are imposed on the service quality and reliability of big data
cloud computing platforms which constitute the infrastructure of Alibaba Cloud.
However, root cause analysis in these platforms is non-trivial due to the
complicated system architecture. In this paper, we propose a root cause
analysis framework called CloudRCA which makes use of heterogeneous
multi-source data including Key Performance Indicators (KPIs), logs, as well as
topology, and extracts important features via state-of-the-art anomaly
detection and log analysis techniques. The engineered features are then
utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to
infer root causes with high accuracy and efficiency. Ablation study and
comprehensive experimental comparisons demonstrate that, compared to existing
frameworks, CloudRCA 1) consistently outperforms existing approaches in
f1-score across different cloud systems; 2) can handle novel types of root
causes thanks to the hierarchical structure of KHBN; 3) performs more robustly
with respect to algorithmic configurations; and 4) scales more favorably in the
data and feature sizes. Experiments also show that a cross-platform transfer
learning mechanism can be adopted to further improve the accuracy by more than
10\%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud
and employed in three typical cloud computing platforms including MaxCompute,
Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more
than $20\%$ in the time spent on resolving failures in the past twelve months
and improves service reliability significantly.</description><identifier>DOI: 10.48550/arxiv.2111.03753</identifier><language>eng</language><subject>Computer Science - Distributed, Parallel, and Cluster Computing ; Computer Science - Learning ; Computer Science - Software Engineering</subject><creationdate>2021-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2111.03753$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2111.03753$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Yingying</creatorcontrib><creatorcontrib>Guan, Zhengxiong</creatorcontrib><creatorcontrib>Qian, Huajie</creatorcontrib><creatorcontrib>Xu, Leili</creatorcontrib><creatorcontrib>Liu, Hengbo</creatorcontrib><creatorcontrib>Wen, Qingsong</creatorcontrib><creatorcontrib>Sun, Liang</creatorcontrib><creatorcontrib>Jiang, Junwei</creatorcontrib><creatorcontrib>Fan, Lunting</creatorcontrib><creatorcontrib>Ke, Min</creatorcontrib><title>CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms</title><description>30th ACM International Conference on Information and Knowledge
Management (CIKM 2021) As business of Alibaba expands across the world among various industries,
higher standards are imposed on the service quality and reliability of big data
cloud computing platforms which constitute the infrastructure of Alibaba Cloud.
However, root cause analysis in these platforms is non-trivial due to the
complicated system architecture. In this paper, we propose a root cause
analysis framework called CloudRCA which makes use of heterogeneous
multi-source data including Key Performance Indicators (KPIs), logs, as well as
topology, and extracts important features via state-of-the-art anomaly
detection and log analysis techniques. The engineered features are then
utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to
infer root causes with high accuracy and efficiency. Ablation study and
comprehensive experimental comparisons demonstrate that, compared to existing
frameworks, CloudRCA 1) consistently outperforms existing approaches in
f1-score across different cloud systems; 2) can handle novel types of root
causes thanks to the hierarchical structure of KHBN; 3) performs more robustly
with respect to algorithmic configurations; and 4) scales more favorably in the
data and feature sizes. Experiments also show that a cross-platform transfer
learning mechanism can be adopted to further improve the accuracy by more than
10\%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud
and employed in three typical cloud computing platforms including MaxCompute,
Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more
than $20\%$ in the time spent on resolving failures in the past twelve months
and improves service reliability significantly.</description><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Software Engineering</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz1FLwzAUBeC87EGmP8An8wfaNblJWn0rwU1hoMy9l9s2dxTbZSStun-_WX06cDgc-Bi7F1mqCq2zFYaf7iuVQog0g1zDDdvY3k_tzpZPvOQ770ducYqOl0fsz7GLfB1wcN8-fHLygc9rbv1wmsbueODvPY7Xfoi3bEHYR3f3n0v2sX7e25dk-7Z5teU2QZND8kgOSDsqnCSlCGqpm5qywoAyzhTagcq1aFBKAzUBGNkaaGpUTauEMrBkD3-vM6Q6hW7AcK5-QdUMggtbj0Uq</recordid><startdate>20211105</startdate><enddate>20211105</enddate><creator>Zhang, Yingying</creator><creator>Guan, Zhengxiong</creator><creator>Qian, Huajie</creator><creator>Xu, Leili</creator><creator>Liu, Hengbo</creator><creator>Wen, Qingsong</creator><creator>Sun, Liang</creator><creator>Jiang, Junwei</creator><creator>Fan, Lunting</creator><creator>Ke, Min</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20211105</creationdate><title>CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms</title><author>Zhang, Yingying ; Guan, Zhengxiong ; Qian, Huajie ; Xu, Leili ; Liu, Hengbo ; Wen, Qingsong ; Sun, Liang ; Jiang, Junwei ; Fan, Lunting ; Ke, Min</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-9fe3f5ef8e2f44f3b25cbf086346e685e34751ca2263bf3362d63cba4cd41463</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Software Engineering</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Yingying</creatorcontrib><creatorcontrib>Guan, Zhengxiong</creatorcontrib><creatorcontrib>Qian, Huajie</creatorcontrib><creatorcontrib>Xu, Leili</creatorcontrib><creatorcontrib>Liu, Hengbo</creatorcontrib><creatorcontrib>Wen, Qingsong</creatorcontrib><creatorcontrib>Sun, Liang</creatorcontrib><creatorcontrib>Jiang, Junwei</creatorcontrib><creatorcontrib>Fan, Lunting</creatorcontrib><creatorcontrib>Ke, Min</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Yingying</au><au>Guan, Zhengxiong</au><au>Qian, Huajie</au><au>Xu, Leili</au><au>Liu, Hengbo</au><au>Wen, Qingsong</au><au>Sun, Liang</au><au>Jiang, Junwei</au><au>Fan, Lunting</au><au>Ke, Min</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms</atitle><date>2021-11-05</date><risdate>2021</risdate><abstract>30th ACM International Conference on Information and Knowledge
Management (CIKM 2021) As business of Alibaba expands across the world among various industries,
higher standards are imposed on the service quality and reliability of big data
cloud computing platforms which constitute the infrastructure of Alibaba Cloud.
However, root cause analysis in these platforms is non-trivial due to the
complicated system architecture. In this paper, we propose a root cause
analysis framework called CloudRCA which makes use of heterogeneous
multi-source data including Key Performance Indicators (KPIs), logs, as well as
topology, and extracts important features via state-of-the-art anomaly
detection and log analysis techniques. The engineered features are then
utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to
infer root causes with high accuracy and efficiency. Ablation study and
comprehensive experimental comparisons demonstrate that, compared to existing
frameworks, CloudRCA 1) consistently outperforms existing approaches in
f1-score across different cloud systems; 2) can handle novel types of root
causes thanks to the hierarchical structure of KHBN; 3) performs more robustly
with respect to algorithmic configurations; and 4) scales more favorably in the
data and feature sizes. Experiments also show that a cross-platform transfer
learning mechanism can be adopted to further improve the accuracy by more than
10\%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud
and employed in three typical cloud computing platforms including MaxCompute,
Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more
than $20\%$ in the time spent on resolving failures in the past twelve months
and improves service reliability significantly.</abstract><doi>10.48550/arxiv.2111.03753</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2111.03753 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2111_03753 |
source | arXiv.org |
subjects | Computer Science - Distributed, Parallel, and Cluster Computing Computer Science - Learning Computer Science - Software Engineering |
title | CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T00%3A09%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CloudRCA:%20A%20Root%20Cause%20Analysis%20Framework%20for%20Cloud%20Computing%20Platforms&rft.au=Zhang,%20Yingying&rft.date=2021-11-05&rft_id=info:doi/10.48550/arxiv.2111.03753&rft_dat=%3Carxiv_GOX%3E2111_03753%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |