Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Haryadi S. Gunawi and Riza O. Suminto, University of Chicago; Russell Sears and Casey Golliher, Pure Storage; Swaminathan Sundararaman, Parallel Machines; Xing Lin and Tim Emami, NetApp; Weiguang Sheng and Nematollah Bidokhti, Huawei; Caitie McCaffrey, Twitter; Gary Grider and Parks M. Fields, Los Alamos National Laboratory; Kevin Harms and Robert B. Ross, Argonne National Laboratory; Andree Jacobson, New Mexico Consortium; Robert Ricci and Kirk Webb, University of Utah; Peter Alvaro, University of California, Santa Cruz, Mingzhe Hao, Huaicheng Li, and H. Birali Runesha, University of Chicago

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.


ChewAnalyzer: Workload-Aware Data Management Across Differentiated Storage Pools

Xiongzi Ge, NetApp, Inc.; Xuchao Xie, NUDT; David H.C. Du, University of Minnesota; Pradeep Ganesan, NetApp, Inc.; Dennis Hahn, NetApp, Inc.;

the 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2018)


More Publications

Eamonn Keogh, UC Riverside – August 2018

While most of today’s always-connected tech devices take advantage of cloud computing, many Internet of Things (IoT) developers increasingly understand the benefits of doing more analytics on the devices themselves, a philosophy known as edge computing. By performing analytic tasks directly on the sensor, edge computing can drastically reduce the bandwidth, cloud processing, and cloud storage needed.

However, even if taken to the extreme, edge computing will occasionally have to report some summary data to a central server. Thus, a common type of IoT analytical query is essentially “Send me some representative/typical data.” This query might be issued by a human attempting to understand an unexpected event at a manufacturing plant, or it might be issued by an algorithm as a subroutine in some higher-level analytics. In either case, the problem of finding representative time series subsequences has not been solved despite the ubiquity of time series in almost all human endeavors, and especially in IoT domains.


John Paparrizos, University of Chicago – August 2018

Kernel methods, a class of machine learning algorithms for pattern recognition, have shown a great deal of promise in the analysis of complex, real-world, data. However, kernel methods remain largely unexplored in the analysis of time- varying measurements (i.e., time series), which is becoming increasingly prevalent across scientific disciplines, industrial settings, and Internet of Things (IoT) applications. Until now, research in time-series analysis has focused on designing methods for three components, namely, (i) representation methods; (ii) comparison functions; and (iii) indexing mechanisms. Unfortunately, these components have typically been investigated and developed independently, resulting in methods that are incompatible with each other. The lack of a unified approach has hindered progress towards scalable analytics over massive time-series collections.

We propose to address this major drawback by leveraging kernel methods to automatically learn time-series representations (i.e., learn to effectively compress time series). Such compact representations are compatible with common indexing mechanisms and, importantly, preserve the invariance to time-series distortions offered by user-defined comparison methods. Therefore, our approach enables computational methods to operate directly over the compressed time-series data, which significantly improves their storage and computation requirements. We propose to evaluate the performance of our learned representations on five tasks of critical importance in time-series analysis, namely, indexing, classification, clustering, sampling, and visualization. Additionally, we have already established a partnership with a leading electric service supplier to develop a case study on predicting future energy demand using large-scale smart meter data. Finally, we plan to integrate our methods on Apache Spark, a prominent big data processing platform, to facilitate analytics over massive time-series collections.


More Fellowships