Tag Archives: reliability

An Empirical Study on Configuration Errors in Commercial and Open Source Systems

Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N. Bairavasundaram, and Shankar Pasupathy.

In this work, we undertake one of the first attempts to conduct a real-world misconfiguration characteristic study

Configuration errors (i.e., misconfigurations) are among the dominant causes of system failures. Their importance has inspired many research efforts on detecting, diagnosing, and fixing misconfigurations; such research would benefit greatly from a real-world characteristic study on misconfigurations. Unfortunately, few such studies have been conducted in the past, primarily because historical misconfigurations usually have not been recorded rigorously in databases.

In this work, we undertake one of the first attempts to conduct a real-world misconfiguration characteristic study. We study a total of 546 real world misconfigurations, including 309 misconfigurations from a commercial storage system deployed at thousands of customers, and 237 from four widely used open source systems (CentOS, MySQL, Apache HTTP Server, and OpenLDAP). Some of our major findings include: (1) A majority of misconfigurations (70.0%∼85.5%) are due to mistakes in setting configuration parameters; however, a significant number of misconfigurations are due to compatibility issues or component configurations (i.e., not parameter-related). (2) 38.1%∼53.7% of parameter mistakes are caused by illegal parameters that clearly violate some format or rules, motivating the use of an automatic configuration checker to detect these miscon- figurations. (3) A significant percentage (12.2%∼29.7%) of parameter-based mistakes are due to inconsistencies between different parameter values. (4) 21.7%∼57.3% of the miscon- figurations involve configurations external to the examined system, some even on entirely different hosts. (5) A significant portion of misconfigurations can cause hard-to-diagnose failures, such as crashes, hangs, or severe performance degradation, indicating that systems should be better-equipped to handle misconfigurations.

In Proceedings of the ACM Symposium on Operating Systems Principles 2011 (SOSP’11)

Resources

  • The author’s version of the paper is attached to this posting, please observe the following copyright:

© ACM, 2011. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in the Proceedings of the ACM Symposium on Operating Systems Principles 2011 (SOSP ’11)    https://dx.doi.org/10.1145/2043556.2043572/https://dx.doi.org/10.1145/2043556.2043572

sosp11-yin.pdf

Bianca Schroeder, University of Toronto – July 2010

bianca4.JPGA Unified Framework for Managing Storage System Reliability

As the loss of data can have devastating consequences for many businesses, one of the most important aspects in designing enterprise storage systems is reliability. Designing and configuring storage systems that minimize the chance of data loss requires a good understanding of how different design or configuration choices and future technological trends affect system reliability.

Unfortunately, the details of many key aspects of storage system reliability are not well understood. This includes for example the effect of environmental parameters, such as temperature, the effect of the type of workload and workload intensity, the degree of correlation between different failures or error events, and the reliability characteristics of new media such as solid state drives. As a results, much existing work on storage system reliability either ignores many of those factors, or relies on simplistic assumptions that do not reflect the real world.

The research work takes a three-pronged approach at overcoming many of the above problems: (1) We plan to collect and analyze field data from production storage systems that will allow us to study those aspects of storage system reliability that are particularly poorly understood; (2) We will use the results of our data analysis to derive more realistic models and simulation environments for evaluating storage reliability; (3) We will use our models to answer some frequently asked questions about storage system reliability and make projections for future challenges in building reliable storage systems.

We expect that the tools and the insights derived from our work will be useful to both practitioners involved in configuring and running large-scale storage systems, as well as the designers of next generation storage systems.