Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics

Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducte...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on storage 2008-11, Vol.4 (3), p.1-25
Hauptverfasser: Jiang, Weihang, Hu, Chongfeng, Zhou, Yuanyuan, Kanevsky, Arkady
Format: Artikel
Sprache:eng
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 25
container_issue 3
container_start_page 1
container_title ACM transactions on storage
container_volume 4
creator Jiang, Weihang
Hu, Chongfeng
Zhou, Yuanyuan
Kanevsky, Arkady
description Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducted on understanding storage failures, almost all of them focus on the failure characteristics of one component—disks—and do not study other storage component failures. This article analyzes the failure characteristics of storage subsystems. More specifically, we analyzed the storage logs collected from about 39,000 storage systems commercially deployed at various customer sites. The dataset covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage-shelf enclosures. Our study reveals many interesting findings, providing useful guidelines for designing reliable storage systems. Some of our major findings include: (1) In addition to disk failures that contribute to 20--55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for a significant percentage of storage subsystem failures. (2) Each individual storage subsystem failure type, and storage subsystem failure as a whole, exhibits strong self-correlations. In addition, these failures exhibit “bursty” patterns. (3) Storage subsystems configured with redundant interconnects experience 30--40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.
doi_str_mv 10.1145/1416944.1416946
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_miscellaneous_34674316</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>34674316</sourcerecordid><originalsourceid>FETCH-LOGICAL-c141t-c19bb69820583799c212926d9fa1518af7c41e4405277b117d191899efadb2ec3</originalsourceid><addsrcrecordid>eNo9kD1PwzAQhi0EEqUws2ZiS-vzZzyhqoKCVIkFZstxbAgkcbGdgX9PUCKGu-cZXp1OL0K3gDcAjG-BgVCMbWaKM7QCzmlJsaLn_y7lJbpK6RNjKgjjK1TtoiuaNn2lIn9MFvp2MEMubBhybOsxh1j4adIk5t0V3rTdGF26v0YX3nTJ3Sxco7fHh9f9U3l8OTzvd8fSTo_kaau6FqoimFdUKmUJEEVEo7wBDpXx0jJwjGFOpKwBZAMKKqWcN01NnKVrdDffPcXwPbqUdd8m67rODC6MSVMmJKMgpuB2DtoYUorO61NsexN_NGD915BeGloo6C9TmldC</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>34674316</pqid></control><display><type>article</type><title>Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics</title><source>ACM Digital Library Complete</source><creator>Jiang, Weihang ; Hu, Chongfeng ; Zhou, Yuanyuan ; Kanevsky, Arkady</creator><creatorcontrib>Jiang, Weihang ; Hu, Chongfeng ; Zhou, Yuanyuan ; Kanevsky, Arkady</creatorcontrib><description>Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducted on understanding storage failures, almost all of them focus on the failure characteristics of one component—disks—and do not study other storage component failures. This article analyzes the failure characteristics of storage subsystems. More specifically, we analyzed the storage logs collected from about 39,000 storage systems commercially deployed at various customer sites. The dataset covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage-shelf enclosures. Our study reveals many interesting findings, providing useful guidelines for designing reliable storage systems. Some of our major findings include: (1) In addition to disk failures that contribute to 20--55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for a significant percentage of storage subsystem failures. (2) Each individual storage subsystem failure type, and storage subsystem failure as a whole, exhibits strong self-correlations. In addition, these failures exhibit “bursty” patterns. (3) Storage subsystems configured with redundant interconnects experience 30--40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.</description><identifier>ISSN: 1553-3077</identifier><identifier>EISSN: 1553-3093</identifier><identifier>DOI: 10.1145/1416944.1416946</identifier><language>eng</language><ispartof>ACM transactions on storage, 2008-11, Vol.4 (3), p.1-25</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c141t-c19bb69820583799c212926d9fa1518af7c41e4405277b117d191899efadb2ec3</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Jiang, Weihang</creatorcontrib><creatorcontrib>Hu, Chongfeng</creatorcontrib><creatorcontrib>Zhou, Yuanyuan</creatorcontrib><creatorcontrib>Kanevsky, Arkady</creatorcontrib><title>Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics</title><title>ACM transactions on storage</title><description>Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducted on understanding storage failures, almost all of them focus on the failure characteristics of one component—disks—and do not study other storage component failures. This article analyzes the failure characteristics of storage subsystems. More specifically, we analyzed the storage logs collected from about 39,000 storage systems commercially deployed at various customer sites. The dataset covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage-shelf enclosures. Our study reveals many interesting findings, providing useful guidelines for designing reliable storage systems. Some of our major findings include: (1) In addition to disk failures that contribute to 20--55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for a significant percentage of storage subsystem failures. (2) Each individual storage subsystem failure type, and storage subsystem failure as a whole, exhibits strong self-correlations. In addition, these failures exhibit “bursty” patterns. (3) Storage subsystems configured with redundant interconnects experience 30--40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.</description><issn>1553-3077</issn><issn>1553-3093</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2008</creationdate><recordtype>article</recordtype><recordid>eNo9kD1PwzAQhi0EEqUws2ZiS-vzZzyhqoKCVIkFZstxbAgkcbGdgX9PUCKGu-cZXp1OL0K3gDcAjG-BgVCMbWaKM7QCzmlJsaLn_y7lJbpK6RNjKgjjK1TtoiuaNn2lIn9MFvp2MEMubBhybOsxh1j4adIk5t0V3rTdGF26v0YX3nTJ3Sxco7fHh9f9U3l8OTzvd8fSTo_kaau6FqoimFdUKmUJEEVEo7wBDpXx0jJwjGFOpKwBZAMKKqWcN01NnKVrdDffPcXwPbqUdd8m67rODC6MSVMmJKMgpuB2DtoYUorO61NsexN_NGD915BeGloo6C9TmldC</recordid><startdate>200811</startdate><enddate>200811</enddate><creator>Jiang, Weihang</creator><creator>Hu, Chongfeng</creator><creator>Zhou, Yuanyuan</creator><creator>Kanevsky, Arkady</creator><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope></search><sort><creationdate>200811</creationdate><title>Are disks the dominant contributor for storage failures?</title><author>Jiang, Weihang ; Hu, Chongfeng ; Zhou, Yuanyuan ; Kanevsky, Arkady</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c141t-c19bb69820583799c212926d9fa1518af7c41e4405277b117d191899efadb2ec3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2008</creationdate><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Jiang, Weihang</creatorcontrib><creatorcontrib>Hu, Chongfeng</creatorcontrib><creatorcontrib>Zhou, Yuanyuan</creatorcontrib><creatorcontrib>Kanevsky, Arkady</creatorcontrib><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>ACM transactions on storage</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Jiang, Weihang</au><au>Hu, Chongfeng</au><au>Zhou, Yuanyuan</au><au>Kanevsky, Arkady</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics</atitle><jtitle>ACM transactions on storage</jtitle><date>2008-11</date><risdate>2008</risdate><volume>4</volume><issue>3</issue><spage>1</spage><epage>25</epage><pages>1-25</pages><issn>1553-3077</issn><eissn>1553-3093</eissn><abstract>Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducted on understanding storage failures, almost all of them focus on the failure characteristics of one component—disks—and do not study other storage component failures. This article analyzes the failure characteristics of storage subsystems. More specifically, we analyzed the storage logs collected from about 39,000 storage systems commercially deployed at various customer sites. The dataset covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage-shelf enclosures. Our study reveals many interesting findings, providing useful guidelines for designing reliable storage systems. Some of our major findings include: (1) In addition to disk failures that contribute to 20--55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for a significant percentage of storage subsystem failures. (2) Each individual storage subsystem failure type, and storage subsystem failure as a whole, exhibits strong self-correlations. In addition, these failures exhibit “bursty” patterns. (3) Storage subsystems configured with redundant interconnects experience 30--40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.</abstract><doi>10.1145/1416944.1416946</doi><tpages>25</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1553-3077
ispartof ACM transactions on storage, 2008-11, Vol.4 (3), p.1-25
issn 1553-3077
1553-3093
language eng
recordid cdi_proquest_miscellaneous_34674316
source ACM Digital Library Complete
title Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T20%3A05%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Are%20disks%20the%20dominant%20contributor%20for%20storage%20failures?:%20A%20comprehensive%20study%20of%20storage%20subsystem%20failure%20characteristics&rft.jtitle=ACM%20transactions%20on%20storage&rft.au=Jiang,%20Weihang&rft.date=2008-11&rft.volume=4&rft.issue=3&rft.spage=1&rft.epage=25&rft.pages=1-25&rft.issn=1553-3077&rft.eissn=1553-3093&rft_id=info:doi/10.1145/1416944.1416946&rft_dat=%3Cproquest_cross%3E34674316%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=34674316&rft_id=info:pmid/&rfr_iscdi=true