Tag Archives: fast

Parity Lost and Parity Regained

usenix08_button.jpgA. Krioukov, L.N. Bairavasundaram, G.R. Goodson, K. Srinivasan, R. Thelen, A.C. Arpaci-Dusseau, and R.H. Arpaci-Dusseau.

This paper uses model checking to evaluate mechanisms used in parity-based RAID systems to protect against a single sector error or a block corruption, and identifies additional protection measures necessary to handle all errors.

RAID storage systems protect data from storage errors, such as data corruption, using a set of one or more integrity techniques, such as checksums. The exact protection offered by certain techniques or a combination of techniques is sometimes unclear. We introduce and apply a formal method of analyzing the design of data protection strategies. Specifically, we use model checking to evaluate whether common protection techniques used in parity-based RAID systems are sufficient in light of the increasingly complex failure modes of modern disk drives. We evaluate the approaches taken by a number of real systems under single-error conditions, and find flaws in every scheme. In particular, we identify a parity pollution problem that spreads corrupt data (the result of a single error) across multiple disks, thus leading to data loss or corruption. We further identify which protection measures must be used to avoid such problems. Finally, we show how to combine real-world failure data with the results from the model checker to estimate the actual likelihood of data loss of different protection strategies.

In Proceedings of the USENIX Conference on File and Storage Technologies 2008 (FAST ’08)

Resources

  • A copy of the paper is attached to this posting.

parity-fast08.pdf

Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage

usenix08_button.jpgMark W. Storer, Kevin M. Greenan, Ethan L. Miller, and Kaladhar Voruganti.

Pergamum is a distributed network of intelligent, disk-based storage appliances that stores data reliably and energy-efficiently.

As the world moves to digital storage for archival purposes, there is an increasing demand for reliable, low-power, cost-effective, easy-to-maintain storage that can still provide adequate performance for information retrieval and auditing purposes. Unfortunately, no current archival system adequately fulfills all of these requirements. Tape-based archival systems suffer from poor random access performance, which prevents the use of inter-media redundancy techniques and auditing, and requires the preservation of legacy hardware. Many disk-based systems are ill-suited for long-term storage because their high energy demands and management requirements make them cost-ineffective for archival purposes.

Our solution, Pergamum, is a distributed network of intelligent, disk-based, storage appliances that stores data reliably and energy-efficiently. While existing MAID systems keep disks idle to save energy, Pergamum adds NVRAM at each node to store data signatures, metadata, and other small items, allowing deferred writes, metadata requests and inter-disk data verification to be performed while the disk is powered off. Pergamum uses both intra-disk and inter-disk redundancy to guard against data loss, relying on hash tree-like structures of algebraic signatures to efficiently verify the correctness of stored data. If failures occur, Pergamum uses staggered rebuild to reduce peak energy usage while rebuilding large redundancy stripes. We show that our approach is comparable in both startup and ongoing costs to other archival technologies and provides very high reliability. An evaluation of our implementation of Pergamum shows that it provides adequate performance.

In Proceedings of the USENIX Conference on File and Storage Technologies 2008 (FAST ’08)

Resources

  • A copy of the paper is attached to this posting.

storer2008pergamum.pdf

Are Disks the Dominant Contributor for Storage Failures? A Comprehensive Study of Storage Subsystem Failure Characteristics

usenix08_button.jpgWeihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky.

This paper analyzes the failure characteristics of storage subsystems, using storage logs from 39,000 commercial storage systems.

Building reliable storage systems becomes increasingly challenging as the complexity of modern storage systems continues to grow. Understanding storage failure characteristics is crucially important for designing and building a reliable storage system. While several recent studies have been conducted on understanding storage failures, almost all of them focus on the failure characteristics of one component – disks – and do not study other storage component failures.

This paper analyzes the failure characteristics of storage subsystems. More specifically, we analyzed the storage logs collected from about 39,000 storage systems commercially deployed at various customer  sites. The data set covers a period of 44 months and includes about 1,800,000 disks hosted in about 155,000 storage shelf enclosures. Our study reveals many interesting findings, providing useful guideline for designing reliable storage systems. Some of our major findings include: (1) In addition to disk failures that contribute to 20-55% of storage subsystem failures, other components such as physical interconnects and protocol stacks also account for significant percentages of storage subsystem failures. (2) Each individual storage subsystem failure type and storage subsystem failure as a whole exhibit strong self-correlations. In addition, these failures exhibit “bursty” patterns. (3) Storage subsystems configured with redundant interconnects experience 30-40% lower failure rates than those with a single interconnect. (4) Spanning disks of a RAID group across multiple shelves provides a more resilient solution for storage subsystems than within a single shelf.

In Proceedings of the USENIX Conference on File and Storage Technologies 2008 (FAST ’08)

Resources

  • A copy of the paper is attached to this posting.

dominant-fast08.pdf

SWEEPER: An Efficient Disaster Recovery Point Identification Mechanism

usenix08_button.jpgAkshat Verma, Kaladhar Voruganti, Ramani Routray, and Rohit Jain.

This paper presents a technique on how to automatically identify recovery points when trying to figure out which backup copy to use based on system events and user-specified RTO/RPO requirements.

Data corruption is one of the key problems that is on top of the radar screen of most CIOs. Continuous Data Protection (CDP) technologies help enterprises deal with data corruption by maintaining multiple versions of data and facilitating recovery by allowing an administrator restore to an earlier clean version of data. The aim of the recovery process after data corruption is to quickly traverse through the backup copies (old versions), and retrieve a clean copy of data. Currently, data recovery is an ad-hoc, time consuming and frustrating process with sequential brute force approaches, where recovery time is proportional to the number of backup copies examined and the time to check a backup copy for data corruption.

In this paper, we present the design and implementation of SWEEPER architecture and backup copy selection algorithms that specifically tackle the problem of quickly and systematically identifying a good recovery point. We monitor various system events and generate checkpoint records that help in quickly identifying a clean backup copy. The SWEEPER methodology dynamically determines the selection algorithm based on user specified recovery time and recovery point objectives, and thus, allows system administrators to perform trade-offs between recovery time and data currentness. We have implemented our solution as part of a popular Storage Resource Manager product and evaluated SWEEPER under many diverse settings. Our study clearly establishes the effectiveness of SWEEPER as a robust strategy to significantly reduce recovery time.

In Proceedings of the USENIX Conference on File and Storage Technologies 2008 (FAST ’08)

Resources

  • A copy of the paper is attached to this posting.

Sweeper-fast08.pdf

AWOL: An AdaptiveWrite Optimizations Layer

usenix08_button.jpgAlexandros Batsakis, Randal Burns, Arkady Kanevsky, James Lentini, and Thomas Talpey.

This paper presents I/O performance improvements from adaptively allocating memory between write buffering and read caching and opportunistically writing dirty pages.

Operating system memory managers fail to consider the population of read versus write pages in the buffer pool or outstanding I/O requests when writing dirty pages to disk or network file systems. This leads to bursty I/O patterns, which stall processes reading data and reduce the efficiency of storage. We address these limitations by adaptively allocating memory between write buffering and read caching and by writing dirty pages to disk opportunistically before the operating system submits them for write-back. We implement and evaluate our methods within the Linux® system and show performance gains of more than 30% for mixed read/write workloads.

In Proceedings of the USENIX Conference on File and Storage Technologies 2008 (FAST ’08)

Resources

  • A copy of the paper is attached to this posting.

awol_fast08.pdf

An Analysis of Data Corruption in the Storage Stack

usenix08_button.jpgL.N. Bairavasundaram, G.R. Goodson, B. Schroeder, A.C. Arpaci-Dusseau, and R.H. Arpaci-Dusseau.

This paper analyzes real-world data on the prevalence of data corruption due to storage stack components such as disk drives, and analyzes its characteristics such as dependence on disk-drive type, spatial and temporal locality, and correlation with workload.

An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this paper, we present the first large-scale study of data corruption. We analyze corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months. We study three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. We focus on checksum mismatches since they occur the most.

We find more than 400,000 instances of checksum mismatches over the 41-month period. We find many interesting trends among these instances including: i) nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise class disk drives, ii) checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality, and iii) checksum mismatches across different disks in the same storage system are not independent. We use our observations to derive lessons for corruption-proof system design.

Best Student Paper Award

In Proceedings of the USENIX Conference on File and Storage Technologies 2008 (FAST ’08)

Resources

corruption-fast08.pdf

Data ONTAP GX: A Scalable Storage Cluster

fast07_button.jpgMichael Eisler, Peter Corbett, Michael Kazar, Daniel S. Nydick, and J. Christopher Wagner.

This paper presents Data ONTAP GX, a clustered Network Attached File server that is composed of a number of cooperating filers.

Data ONTAP GX is a clustered Network Attached File server composed of a number of cooperating filers. Each filer manages its own local file system, which consists of a number of disconnected flexible volumes. A separate namespace infrastructure runs within the cluster, which connects the volumes into one or more namespaces by means of internal junctions. The cluster collectively exposes a potentially large number of separate virtual servers, each with its own independent namespace, security and administrative domain. The cluster implements a protocol routing and translation layer which translates requests in all incoming file protocols into a single unified internal file access protocol called SpinNP. The translated requests are then forwarded to the correct filer within the cluster for servicing by the local file system instance. This provides data location transparency, which is used to support transparent data migration, load balancing, mirroring for load sharing and data protection, and fault tolerance. The cluster itself greatly simplifies the administration of a large number of filers by consolidating them into a single system image. Results from benchmarks (over one million file operations per second on a 24 node cluster) and customer experience demonstrate linear scaling.

In Proceedings of the USENIX Conference on File and Storage Technologies 2007 (FAST ’07)

Resources

  • A copy of the paper is attached to this posting.

GX-fast2007.pdf

Row-Diagonal Parity for Double Disk Failure Correction

 

Peter Corbett, Bob English, Atul Goel, Tomislav Grcanac, Steven Kleiman, James Leong, and Sunitha Sankar.

This paper introduces Row-Diagonal Parity (RDP), a new algorithm for protecting against double disk failures.

Row-Diagonal Parity (RDP) is a new algorithm for protecting against double disk failures. It stores all data unencoded, and uses only exclusive-or operations to compute parity. RDP is provably optimal in computational complexity, both during construction and reconstruction. Like other algorithms, it is optimal in the amount of redundant information stored and accessed. RDP works within a single stripe of blocks of sizes normally used by file systems, databases and disk arrays. It can be utilized in a fixed (RAID-4) or rotated (RAID-5) parity placement style. It is possible to extend the algorithm to encompass multiple RAID-4 or RAID-5 disk arrays in a single RDP disk array. It is possible to add disks to an existing RDP array without recalculating parity or moving data. Implementation results show that RDP performance can be made nearly equal to single parity RAID-4 and RAID-5 performance.

Award_icon.jpgBest Paper Award

 

In Proceedings of the USENIX Conference on File and Storage Technologies 2004 (FAST ’04)

Resources

  • A copy of the paper is attached to this posting.

rdp-fast04.pdf

SnapMirror®: File System Based Asynchronous Mirroring for Disaster Recovery

 

Hugo Patterson, Stephen Manley, Mike Federwisch, Dave Hitz, Steve Kleiman, and Shane Owara.

We present SnapMirror, an asynchronous mirroring technology that leverages file system snapshots to ensure the consistency of the remote mirror and optimize data transfer.

Computerized data has become critical to the survival of an enterprise. Companies must have a strategy for recovering their data should a disaster such as a fire destroy the primary data center. Current mechanisms offer data managers a stark choice: rely on affordable tape but risk the loss of a full day of data and face many hours or even days to recover, or have the benefits of a fully synchronized on-line remote mirror, but pay steep costs in both write latency and network bandwidth to maintain the mirror. In this paper, we argue that asynchronous mirroring, in which batches of updates are periodically sent to the remote mirror, can let data managers find a balance between these extremes. First, by eliminating the write latency issue, asynchrony greatly reduces the performance cost of a remote mirror. Second, by storing up batches of writes, asynchronous mirroring can avoid sending deleted or overwritten data and thereby reduce network bandwidth requirements. Data managers can tune the update frequency to trade network bandwidth against the potential loss of more data. We present SnapMirror, an asynchronous mirroring technology that leverages file system snapshots to ensure the consistency of the remote mirror and optimize data transfer. We use traces of production filers to show that even updating an asynchronous mirror every 15 minutes can reduce data transferred by 30% to 80%. We find that exploiting file system knowledge of deletions is critical to achieving any reduction for no-overwrite file systems such as WAFL and LFS. Experiments on a running system show that using file system metadata can reduce the time to identify changed blocks from minutes to seconds compared to purely logical approaches. Finally, we show that using SnapMirror to update every 30 minutes increases the response time of a heavily loaded system only 22%.

In Proceedings of the USENIX Conference on File and Storage Technologies 2002 (FAST ’02)

Resources

  • A copy of the paper is attached to this posting.

snapmirror-fast02.pdf