CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

30th ACM International Conference on Information and Knowledge Management (CIKM 2021) As business of Alibaba expands across the world among various industries, higher standards are imposed on the service quality and reliability of big data cloud computing platforms which constitute the infrastructur...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zhang, Yingying, Guan, Zhengxiong, Qian, Huajie, Xu, Leili, Liu, Hengbo, Wen, Qingsong, Sun, Liang, Jiang, Junwei, Fan, Lunting, Ke, Min
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Zhang, Yingying
Guan, Zhengxiong
Qian, Huajie
Xu, Leili
Liu, Hengbo
Wen, Qingsong
Sun, Liang
Jiang, Junwei
Fan, Lunting
Ke, Min
description 30th ACM International Conference on Information and Knowledge Management (CIKM 2021) As business of Alibaba expands across the world among various industries, higher standards are imposed on the service quality and reliability of big data cloud computing platforms which constitute the infrastructure of Alibaba Cloud. However, root cause analysis in these platforms is non-trivial due to the complicated system architecture. In this paper, we propose a root cause analysis framework called CloudRCA which makes use of heterogeneous multi-source data including Key Performance Indicators (KPIs), logs, as well as topology, and extracts important features via state-of-the-art anomaly detection and log analysis techniques. The engineered features are then utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to infer root causes with high accuracy and efficiency. Ablation study and comprehensive experimental comparisons demonstrate that, compared to existing frameworks, CloudRCA 1) consistently outperforms existing approaches in f1-score across different cloud systems; 2) can handle novel types of root causes thanks to the hierarchical structure of KHBN; 3) performs more robustly with respect to algorithmic configurations; and 4) scales more favorably in the data and feature sizes. Experiments also show that a cross-platform transfer learning mechanism can be adopted to further improve the accuracy by more than 10\%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud and employed in three typical cloud computing platforms including MaxCompute, Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more than $20\%$ in the time spent on resolving failures in the past twelve months and improves service reliability significantly.
doi_str_mv 10.48550/arxiv.2111.03753
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2111_03753</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2111_03753</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-9fe3f5ef8e2f44f3b25cbf086346e685e34751ca2263bf3362d63cba4cd41463</originalsourceid><addsrcrecordid>eNotz1FLwzAUBeC87EGmP8An8wfaNblJWn0rwU1hoMy9l9s2dxTbZSStun-_WX06cDgc-Bi7F1mqCq2zFYaf7iuVQog0g1zDDdvY3k_tzpZPvOQ770ducYqOl0fsz7GLfB1wcN8-fHLygc9rbv1wmsbueODvPY7Xfoi3bEHYR3f3n0v2sX7e25dk-7Z5teU2QZND8kgOSDsqnCSlCGqpm5qywoAyzhTagcq1aFBKAzUBGNkaaGpUTauEMrBkD3-vM6Q6hW7AcK5-QdUMggtbj0Uq</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms</title><source>arXiv.org</source><creator>Zhang, Yingying ; Guan, Zhengxiong ; Qian, Huajie ; Xu, Leili ; Liu, Hengbo ; Wen, Qingsong ; Sun, Liang ; Jiang, Junwei ; Fan, Lunting ; Ke, Min</creator><creatorcontrib>Zhang, Yingying ; Guan, Zhengxiong ; Qian, Huajie ; Xu, Leili ; Liu, Hengbo ; Wen, Qingsong ; Sun, Liang ; Jiang, Junwei ; Fan, Lunting ; Ke, Min</creatorcontrib><description>30th ACM International Conference on Information and Knowledge Management (CIKM 2021) As business of Alibaba expands across the world among various industries, higher standards are imposed on the service quality and reliability of big data cloud computing platforms which constitute the infrastructure of Alibaba Cloud. However, root cause analysis in these platforms is non-trivial due to the complicated system architecture. In this paper, we propose a root cause analysis framework called CloudRCA which makes use of heterogeneous multi-source data including Key Performance Indicators (KPIs), logs, as well as topology, and extracts important features via state-of-the-art anomaly detection and log analysis techniques. The engineered features are then utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to infer root causes with high accuracy and efficiency. Ablation study and comprehensive experimental comparisons demonstrate that, compared to existing frameworks, CloudRCA 1) consistently outperforms existing approaches in f1-score across different cloud systems; 2) can handle novel types of root causes thanks to the hierarchical structure of KHBN; 3) performs more robustly with respect to algorithmic configurations; and 4) scales more favorably in the data and feature sizes. Experiments also show that a cross-platform transfer learning mechanism can be adopted to further improve the accuracy by more than 10\%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud and employed in three typical cloud computing platforms including MaxCompute, Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more than $20\%$ in the time spent on resolving failures in the past twelve months and improves service reliability significantly.</description><identifier>DOI: 10.48550/arxiv.2111.03753</identifier><language>eng</language><subject>Computer Science - Distributed, Parallel, and Cluster Computing ; Computer Science - Learning ; Computer Science - Software Engineering</subject><creationdate>2021-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2111.03753$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2111.03753$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Yingying</creatorcontrib><creatorcontrib>Guan, Zhengxiong</creatorcontrib><creatorcontrib>Qian, Huajie</creatorcontrib><creatorcontrib>Xu, Leili</creatorcontrib><creatorcontrib>Liu, Hengbo</creatorcontrib><creatorcontrib>Wen, Qingsong</creatorcontrib><creatorcontrib>Sun, Liang</creatorcontrib><creatorcontrib>Jiang, Junwei</creatorcontrib><creatorcontrib>Fan, Lunting</creatorcontrib><creatorcontrib>Ke, Min</creatorcontrib><title>CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms</title><description>30th ACM International Conference on Information and Knowledge Management (CIKM 2021) As business of Alibaba expands across the world among various industries, higher standards are imposed on the service quality and reliability of big data cloud computing platforms which constitute the infrastructure of Alibaba Cloud. However, root cause analysis in these platforms is non-trivial due to the complicated system architecture. In this paper, we propose a root cause analysis framework called CloudRCA which makes use of heterogeneous multi-source data including Key Performance Indicators (KPIs), logs, as well as topology, and extracts important features via state-of-the-art anomaly detection and log analysis techniques. The engineered features are then utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to infer root causes with high accuracy and efficiency. Ablation study and comprehensive experimental comparisons demonstrate that, compared to existing frameworks, CloudRCA 1) consistently outperforms existing approaches in f1-score across different cloud systems; 2) can handle novel types of root causes thanks to the hierarchical structure of KHBN; 3) performs more robustly with respect to algorithmic configurations; and 4) scales more favorably in the data and feature sizes. Experiments also show that a cross-platform transfer learning mechanism can be adopted to further improve the accuracy by more than 10\%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud and employed in three typical cloud computing platforms including MaxCompute, Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more than $20\%$ in the time spent on resolving failures in the past twelve months and improves service reliability significantly.</description><subject>Computer Science - Distributed, Parallel, and Cluster Computing</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Software Engineering</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz1FLwzAUBeC87EGmP8An8wfaNblJWn0rwU1hoMy9l9s2dxTbZSStun-_WX06cDgc-Bi7F1mqCq2zFYaf7iuVQog0g1zDDdvY3k_tzpZPvOQ770ducYqOl0fsz7GLfB1wcN8-fHLygc9rbv1wmsbueODvPY7Xfoi3bEHYR3f3n0v2sX7e25dk-7Z5teU2QZND8kgOSDsqnCSlCGqpm5qywoAyzhTagcq1aFBKAzUBGNkaaGpUTauEMrBkD3-vM6Q6hW7AcK5-QdUMggtbj0Uq</recordid><startdate>20211105</startdate><enddate>20211105</enddate><creator>Zhang, Yingying</creator><creator>Guan, Zhengxiong</creator><creator>Qian, Huajie</creator><creator>Xu, Leili</creator><creator>Liu, Hengbo</creator><creator>Wen, Qingsong</creator><creator>Sun, Liang</creator><creator>Jiang, Junwei</creator><creator>Fan, Lunting</creator><creator>Ke, Min</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20211105</creationdate><title>CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms</title><author>Zhang, Yingying ; Guan, Zhengxiong ; Qian, Huajie ; Xu, Leili ; Liu, Hengbo ; Wen, Qingsong ; Sun, Liang ; Jiang, Junwei ; Fan, Lunting ; Ke, Min</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-9fe3f5ef8e2f44f3b25cbf086346e685e34751ca2263bf3362d63cba4cd41463</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Distributed, Parallel, and Cluster Computing</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Software Engineering</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Yingying</creatorcontrib><creatorcontrib>Guan, Zhengxiong</creatorcontrib><creatorcontrib>Qian, Huajie</creatorcontrib><creatorcontrib>Xu, Leili</creatorcontrib><creatorcontrib>Liu, Hengbo</creatorcontrib><creatorcontrib>Wen, Qingsong</creatorcontrib><creatorcontrib>Sun, Liang</creatorcontrib><creatorcontrib>Jiang, Junwei</creatorcontrib><creatorcontrib>Fan, Lunting</creatorcontrib><creatorcontrib>Ke, Min</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Yingying</au><au>Guan, Zhengxiong</au><au>Qian, Huajie</au><au>Xu, Leili</au><au>Liu, Hengbo</au><au>Wen, Qingsong</au><au>Sun, Liang</au><au>Jiang, Junwei</au><au>Fan, Lunting</au><au>Ke, Min</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms</atitle><date>2021-11-05</date><risdate>2021</risdate><abstract>30th ACM International Conference on Information and Knowledge Management (CIKM 2021) As business of Alibaba expands across the world among various industries, higher standards are imposed on the service quality and reliability of big data cloud computing platforms which constitute the infrastructure of Alibaba Cloud. However, root cause analysis in these platforms is non-trivial due to the complicated system architecture. In this paper, we propose a root cause analysis framework called CloudRCA which makes use of heterogeneous multi-source data including Key Performance Indicators (KPIs), logs, as well as topology, and extracts important features via state-of-the-art anomaly detection and log analysis techniques. The engineered features are then utilized in a Knowledge-informed Hierarchical Bayesian Network (KHBN) model to infer root causes with high accuracy and efficiency. Ablation study and comprehensive experimental comparisons demonstrate that, compared to existing frameworks, CloudRCA 1) consistently outperforms existing approaches in f1-score across different cloud systems; 2) can handle novel types of root causes thanks to the hierarchical structure of KHBN; 3) performs more robustly with respect to algorithmic configurations; and 4) scales more favorably in the data and feature sizes. Experiments also show that a cross-platform transfer learning mechanism can be adopted to further improve the accuracy by more than 10\%. CloudRCA has been integrated into the diagnosis system of Alibaba Cloud and employed in three typical cloud computing platforms including MaxCompute, Realtime Compute and Hologres. It saves Site Reliability Engineers (SREs) more than $20\%$ in the time spent on resolving failures in the past twelve months and improves service reliability significantly.</abstract><doi>10.48550/arxiv.2111.03753</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2111.03753
ispartof
issn
language eng
recordid cdi_arxiv_primary_2111_03753
source arXiv.org
subjects Computer Science - Distributed, Parallel, and Cluster Computing
Computer Science - Learning
Computer Science - Software Engineering
title CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T00%3A09%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=CloudRCA:%20A%20Root%20Cause%20Analysis%20Framework%20for%20Cloud%20Computing%20Platforms&rft.au=Zhang,%20Yingying&rft.date=2021-11-05&rft_id=info:doi/10.48550/arxiv.2111.03753&rft_dat=%3Carxiv_GOX%3E2111_03753%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true