Haryadi Gunawi, University of Chicago – September 2013

haryadi.jpgStudy of Limping Hardware Characteristics

The detection of “limping” hardware — that is, hardware whose performance varies from its specification, is important to maintain performance, reliability, and availability of clustered Data ONTAP systems. The reports and anecdotes provided in the proposal remind of a quote attributed to Leslie Lamport: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” Clustered Data ONTAP is a distributed system; it employs many of its techniques such as remote-procedure calls (SpinNP), failure detectors, and consensus algorithms. A “limping” component on one controller can percolate to other controllers rendering the entire system unusable; even worse such “limping” hardware is undetected today which adds a burden to the NetApp support team. Knowing how to detect such hardware and correcting them dynamically will therefore be valuable to NetApp.