An Analysis of Data Corruption in the Storage Stack

usenix08_button.jpgL.N. Bairavasundaram, G.R. Goodson, B. Schroeder, A.C. Arpaci-Dusseau, and R.H. Arpaci-Dusseau.

This paper analyzes real-world data on the prevalence of data corruption due to storage stack components such as disk drives, and analyzes its characteristics such as dependence on disk-drive type, spatial and temporal locality, and correlation with workload.

An important threat to reliable storage of data is silent data corruption. In order to develop suitable protection mechanisms against data corruption, it is essential to understand its characteristics. In this paper, we present the first large-scale study of data corruption. We analyze corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months. We study three classes of corruption: checksum mismatches, identity discrepancies, and parity inconsistencies. We focus on checksum mismatches since they occur the most.

We find more than 400,000 instances of checksum mismatches over the 41-month period. We find many interesting trends among these instances including: i) nearline disks (and their adapters) develop checksum mismatches an order of magnitude more often than enterprise class disk drives, ii) checksum mismatches within the same disk are not independent events and they show high spatial and temporal locality, and iii) checksum mismatches across different disks in the same storage system are not independent. We use our observations to derive lessons for corruption-proof system design.

Best Student Paper Award

In Proceedings of the USENIX Conference on File and Storage Technologies 2008 (FAST ’08)

Resources

corruption-fast08.pdf