Advanced Technology Group
in the CSO Office

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Haryadi S. Gunawi and Riza O. Suminto, University of Chicago; Russell Sears and Casey Golliher, Pure Storage; Swaminathan Sundararaman, Parallel Machines; Xing Lin and Tim Emami, NetApp; Weiguang Sheng and Nematollah Bidokhti, Huawei; Caitie McCaffrey, Twitter; Gary Grider and Parks M. Fields, Los Alamos National Laboratory; Kevin Harms and Robert B. Ross, Argonne National Laboratory; Andree Jacobson, New Mexico Consortium; Robert Ricci and Kirk Webb, University of Utah; Peter Alvaro, University of California, Santa Cruz, Mingzhe Hao, Huaicheng Li, and H. Birali Runesha, University of Chicago

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

ChewAnalyzer: Workload-Aware Data Management Across Differentiated Storage Pools

Xiongzi Ge, NetApp, Inc.; Xuchao Xie, NUDT; David H.C. Du, University of Minnesota; Pradeep Ganesan, NetApp, Inc.; Dennis Hahn, NetApp, Inc.;

the 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2018)

More Publications

Eamonn Keogh, UC Riverside – August 2018

Time Series Snippets: A New Analytics Primitive with applications to IoT Edge Computing

While most of today’s always-connected tech devices take advantage of cloud computing, many Internet of Things (IoT) developers increasingly understand the benefits of doing more analytics on the devices themselves, a philosophy known as edge computing. By performing analytic tasks directly on the sensor, edge computing can drastically reduce the bandwidth, cloud processing, and cloud storage needed.

John Paparrizos, University of Chicago – August 2018

Accelerating Internet of Things Data Analytics through Scalable Time-Series Representation Learning

Kernel methods, a class of machine learning algorithms for pattern recognition, have shown a great deal of promise in the analysis of complex, real-world, data. However, kernel methods remain largely unexplored in the analysis of time- varying measurements (i.e., time series), which is becoming increasingly prevalent across scientific disciplines, industrial settings, and Internet of Things (IoT) applications. Until now, research in time-series analysis has focused on designing methods for three components, namely, (i) representation methods; (ii) comparison functions; and (iii) indexing mechanisms. Unfortunately, these components have typically been investigated and developed independently, resulting in methods that are incompatible with each other. The lack of a unified approach has hindered progress towards scalable analytics over massive time-series collections.

More Fellowships