All posts by xing

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Haryadi S. Gunawi and Riza O. Suminto, University of Chicago; Russell Sears and Casey Golliher, Pure Storage; Swaminathan Sundararaman, Parallel Machines; Xing Lin and Tim Emami, NetApp; Weiguang Sheng and Nematollah Bidokhti, Huawei; Caitie McCaffrey, Twitter; Gary Grider and Parks M. Fields, Los Alamos National Laboratory; Kevin Harms and Robert B. Ross, Argonne National Laboratory; Andree Jacobson, New Mexico Consortium; Robert Ricci and Kirk Webb, University of Utah; Peter Alvaro, University of California, Santa Cruz, Mingzhe Hao, Huaicheng Li, and H. Birali Runesha, University of Chicago

ACM Transactions on Storage (TOS) TOS
Volume 14 Issue 3, October 2018
Article No. 23

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

Resources

ChewAnalyzer: Workload-Aware Data Management Across Differentiated Storage Pools

Xiongzi Ge, NetApp, Inc.; Xuchao Xie, NUDT; David H.C. Du, University of Minnesota; Pradeep Ganesan, NetApp, Inc.; Dennis Hahn, NetApp, Inc.;

the 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2018)

September 25 – 28, 2018
Milwaukee, Wisconsin, US

In multi-tier storage systems, moving data from one tier to the next can be inefficient. And because each type of storage device has its own idiosyncrasies with respect to the workloads that it can best support, unnecessary data movement might result. In this paper, we explore a fully connected storage architecture in which data can move from any storage pool to another. We propose a Chunk-level storage-aware workload Analyzer framework, abbreviated as ChewAnalyzer, to facilitate efficient data placement. Access patterns are characterized in a flexible way by a collection of I/O accesses to a data chunk. ChewAnalyzer employs a Hierarchical Classifier [30] to analyze the chunk patterns step by step. In each classification step, the Chunk Placement Recommender suggests new data placement policies according to the device properties. Based on the analysis of access pattern changes, the Storage Manager can adequately distribute or migrate the data chunks across different storage pools. Our experimental results show that ChewAnalyzer improves the initial data placement and that it migrates data into the proper pools directly and efficiently.

Resources

A model-based approach to streamlining distributed training for asynchronous SGD

Sung-Han Lin, NetApp; Marco Paolieri, University of Southern California; Cheng-Fu Chou, National Taiwan University; Leana Golubchik, University of Southern California

the 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2018)

September 25 – 28, 2018
Milwaukee, Wisconsin, US

The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools,
hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data and computation among multiple nodes, periodically exchanging intermediate results. In this paper, we address two important problems for the application of this strategy to large-scale clusters and multiple, heterogeneous jobs. First, we propose and validate a queueing model to estimate the throughput of a training job as a function of the number of nodes assigned to the job; this model targets asynchronous Stochastic Gradient Descent (SGD), a popular strategy for distributed training, and requires only data from quick, two-node profiling in addition to job characteristics (number of requested training epochs, mini-batch size, size of DNN parameters, assigned bandwidth). Throughput estimations are then used to explore several classes of scheduling heuristics to reduce response time in a scenario where heterogeneous jobs are continuously submitted to a large-scale cluster. These scheduling algorithms dynamically select which jobs to run and how many nodes to assign to each job, based on different trade-offs between service time reduction and efficiency (e.g., speedup per additional node). Heuristics are evaluated through extensive simulations of realistic DNN workloads, also investigating the effects of early termination, a common scenario for DNN training jobs.

Resources

Efficient Search for Free Blocks in the WAFL File System

Ram Kesavan, Matthew Curtis-Maury, and Mrinal Bhattacharjee (NetApp)

47th International Conference on Parallel Processing (ICPP 2018)
August 13-16, 2018 Eugene, Oregon, USA

The WAFL write allocator is responsible for assigning blocks on persistent storage to data in a way that maximizes both write throughput to the storage media and subsequent read performance of data. The ability to quickly and efficiently guide the write allocator toward desirable regions of available free space is critical to achieving that goal. This ability is influenced by several factors, such as any underlying RAID geometry, media-specific attributes such as erase-block size of solid state drives or zone size of shingled magnetic hard drives, and free space fragmentation. This paper presents and evaluates the techniques used by the WAFL write allocator to efficiently find regions of free space.

Resources

A Secure Cloud with Minimal Provider Trust

Amin Mosayyebzadeh and Gerardo Ravago, Boston University; Apoorve Mohan, Northeastern University; Ali Raza and Sahil Tikale, Boston University; Nabil Schear, MIT Lincoln Laboratory; Trammell Hudson, Two Sigma; Jason Hennessey, Boston University and NetApp; Naved Ansari, Boston University; Kyle Hogan, MIT; Charles Munson, MIT Lincoln Laboratory; Larry Rudolph, Two Sigma; Gene Cooperman and Peter Desnoyers, Northeastern University; Orran Krieger, Boston University

10th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud ’18)
JULY 9, 2018
BOSTON, MA, USA

Bolted is a new architecture for a bare metal cloud with the goal of providing security-sensitive customers of a cloud the same level of security and control that they can obtain in their own private data centers. It allows tenants to elastically allocate secure resources within a cloud while being protected from other previous, current, and future tenants of the cloud. The provisioning of a new server to a tenant isolates a bare metal server, only allowing it to communicate with other tenant’s servers once its critical firmware and software have been attested to the tenant. Tenants, rather than the provider, control the tradeoffs between security, price, and performance. A prototype demonstrates scalable end-to-end security with small overhead compared to a less secure alternative.

Resources

Empirical Evaluation and Enhancement of Enterprise Storage System Request Scheduling

Deng Zhou, San Diego State University; Vania Fang, NetApp; Tao Xie, Wen Pan, San Diego State University; Ram Kesavan, Tony Lin, and Naresh Patel, NetApp

ACM Transactions on Storage (TOS) Volume 14 Issue 2, May 201 Article No. 14

Since little has been reported in the literature concerning enterprise storage system file-level request scheduling, we do not have enough knowledge about how various scheduling factors affect performance. Moreover, we are in lack of a good understanding on how to enhance request scheduling to adapt to the changing characteristics of workloads and hardware resources. To answer these questions, we first build a request scheduler prototype based on WAFL®, a mainstream file system running on numerous enterprise storage systems worldwide. Next, we use the prototype to quantitatively measure the impact of various scheduling configurations on performance on a NetApp®’s enterprise-class storage system. Several observations have been made. For example, we discover that in order to improve performance, the priority of write requests and non-preempted restarted requests should be boosted in some workloads. Inspired by these observations, we further propose two scheduling enhancement heuristics called SORD (size-oriented request dispatching) and QATS (queue-depth aware time slicing). Finally, we evaluate them by conducting a wide range of experiments using workloads generated by SPC-1 and SFS2014 on both HDD-based and all-flash platforms. Experimental results show that the combination of the two can noticeably reduce average request latency under some workloads.

Resources

PASTE: A Network Programming Interface for Non-Volatile Main Memory

Michio Honda, NEC Laboratories Europe; Giuseppe Lettieri, Università di Pisa; Lars Eggert and Douglas Santry, NetApp

15th USENIX Symposium on Networked Systems Design and Implementation APRIL 9–11, 2018 RENTON, WA, USA

Non-Volatile Main Memory (NVMM) devices have been integrated into general-purpose operating systems through familiar file-based interfaces, providing efficient bytegranularity access by bypassing page caches. To leverage the unique advantages of these high-performance media, the storage stack is migrating from the kernel into user-space. However, application performance remains fundamentally limited unless network stacks explicitly integrate these new storage media and follow the migration of storage stacks into user-space. Moreover, we argue that the storage and the network stacks must be considered together when being designed for NVMM. This requires a thoroughly new network stack design, including low-level buffer management and APIs.

We propose PASTE, a new network programming interface for NVMM. It supports familiar abstractions—including busy-polling, blocking, protection, and run-to-completion—with standard network protocols such as TCP and UDP. By operating directly on NVMM, it can be closely integrated with the persistence layer of applications. Once data is DMA’ed from a network interface card to host memory (NVMM), it never needs to be copied again—even for persistence. We demonstrate the general applicability of PASTE by implementing two popular persistent data structures: a write-ahead log and a B+ tree. We further apply PASTE to three applications: Redis, a popular persistent key-value store, pKVS, our HTTP-based key value store and the logging component of a software switch, demonstrating that PASTE not only accelerates networked storage but also enables conventional networking functions to support new features.

Resources

Optimizing Energy-Performance Trade-Offs in Solar-Powered Edge Devices

Peter G. Harrison, Imperial College London; Naresh M. Patel, NetApp

Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering April 09 – 13, 2018

Power modes can be used to save energy in electronic devices but a low power level typically degrades performance. This trade-off is addressed in the so-called EP-queue model, which is a queue depth dependent M/GI/1 queue augmented with power-down and power-up phases of operation. The ability to change service times by power settings allows us to leverage a Markov Decision Process (MDP), which approach we illustrate using a simple fully solar-powered case study with finite states representing levels of battery charge and solar intensity.

Resources

Clay Codes: Moulding MDS Codes to Yield an MSR Code

Myna Vajha, Vinayak Ramkumar, Bhagyashree Puranik, Ganesh Kini, Elita Lobo, Birenjith Sasidharan, and P. Vijay Kumar, Indian Institute of Science, Bangalore; Alexandar Barg and Min Ye, University of Maryland; Srinivasan Narayanamurthy, Syed Hussain, and Siddhartha Nandi, NetApp ATG, Bangalore

The 16th USENIX Conference on File and Storage Technologies
FEBRUARY 12–15, 2018
OAKLAND, CA, USA

With increase in scale, the number of node failures in a data center increases sharply. To ensure availability of data, failure-tolerance schemes such as Reed-Solomon (RS) or more generally, Maximum Distance Separable (MDS) erasure codes are used. However, while MDS codes offer minimum storage overhead for a given amount of failure tolerance, they do not meet other practical needs of today’s data centers. Although modern codes such as Minimum Storage Regenerating (MSR) codes are designed to meet these practical needs, they are available only in highly-constrained theoretical constructions, that are not sufficiently mature enough for practical implementation. We present Clay codes that extract the best from both worlds. Clay (short for Coupled-Layer) codes are MSR codes that offer a simplified construction for decoding/repair by using pairwise coupling across multiple stacked layers of any single MDS code. In addition, Clay codes provide the first practical implementation of an MSR code that offers (a) low storage overhead, (b) simultaneous optimality in terms of three key parameters: repair bandwidth, sub-packetization level and disk I/O, (c) uniform repair performance of data and parity nodes and (d) support for both single and multiple-node repairs, while permitting faster and more efficient repair.

While all MSR codes are vector codes, none of the distributed storage systems support vector codes. We have modified Ceph to support any vector code, and our contribution is now a part of Ceph’s master codebase. We have implemented Clay codes, and integrated it as a plugin to Ceph. Six example Clay codes were evaluated on a cluster of Amazon EC2 instances and code parameters were carefully chosen to match known erasure-code deployments in practice. A particular example code, with storage overhead 1.25x, is shown to reduce repair network traffic by a factor of 2.9 in comparison with RS codes and similar reductions are obtained for both repair time and disk read.

Resources

Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Haryadi S. Gunawi and Riza O. Suminto, University of Chicago; Russell Sears and Casey Golliher, Pure Storage; Swaminathan Sundararaman, Parallel Machines; Xing Lin and Tim Emami, NetApp; Weiguang Sheng and Nematollah Bidokhti, Huawei; Caitie McCaffrey, Twitter; Gary Grider and Parks M. Fields, Los Alamos National Laboratory; Kevin Harms and Robert B. Ross, Argonne National Laboratory; Andree Jacobson, New Mexico Consortium; Robert Ricci and Kirk Webb, University of Utah; Peter Alvaro, University of California, Santa Cruz; H. Birali Runesha, Mingzhe Hao, and Huaicheng Li, University of Chicago

The 16th USENIX Conference on File and Storage Technologies
FEBRUARY 12–15, 2018
OAKLAND, CA, USA

Fail-slow hardware is an under-studied failure mode. We present a study of 101 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 12 institutions. We show that all hardware types such as disk, SSD, CPU, memory and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

Resources