All posts by xing

Algorithms and Data Structures for Efficient Free Space Reclamation in WAFL

Ram Kesavan, Rohit Singh, and Travis Grusecki, NetApp; Yuvraj Patel, University of Wisconsin—Madison

15th USENIX Conference on File and Storage Technologies (FAST 2017)
Feb. 27 – March 2, 2017 Santa Clara, CA

NetApp®WAFL®is a transactional file system that uses the copy-on-write mechanism to support fast write performance and efficient snapshot creation. However, copy-on-write increases the demand on the file system to find free blocks quickly; failure to do so may impede allocations for incoming writes. Efficiency is also important, because the task may consume CPU and other resources. In this paper, we describe the evolution (over more than a decade) of WAFL’s algorithms and data structures for reclaiming space with minimal impact on the overall storage appliance performance.


PASTE: Network Stacks must Integrate with NVMM Abstractions

Michio Honda, Lars Eggert, Douglas Santry. NetApp, Inc.

Fifteenth ACM Workshop on Hot Topics in Networks (HotNets 2016)
November 9-10, 2016 — Atlanta, Georgia , USA

This paper argues that the lack of explicit support for nonvolatile main memory (NVMM) in network stacks fundamentally limits application performance. NVMM devices have been integrated into general-purpose OSes by providing familiar file-based interfaces and efficient byte-granularity access by bypassing page caches. However, this powerful property cannot be fully utilized unless network stacks also support it and applications exploit such support. This requires a thoroughly new network stack design, including low-level buffer management and APIs. We propose such a new network stack architecture to support NVMM and demonstrate its advantages for efficient write-ahead logging, a popular technique to implement transactions.


Early Detection of Configuration Errors to Reduce Failure Damage

Tianyin Xu, Xinxin Jin, Peng Huang, and Yuanyuan Zhou, University of California, San Diego; Shan Lu, University of Chicago; Long Jin, University of California, San Diego; Shankar Pasupathy, NetApp, Inc.

Awarded Best Paper
12th USENIX Symposium on Operating Systems Design and Implementation

Early detection is the key to minimizing failure damage induced by configuration errors, especially those errors in configurations that control failure handling and fault tolerance. Since such configurations are not needed for initialization, many systems do not check their settings early (e.g., at startup time). Consequently, the errors become latent until their manifestations cause severe damage, such as breaking the failure handling. Such latent errors are likely to escape from sysadmins’ observation and testing, and be deployed to production at scale.

Our study shows that many of today’s mature, widely-used software systems are subject to latent configuration errors (referred to as LC errors) in their critically important configurations—those related to the system’s reliability, availability, and serviceability. One root cause is that many (14.0%–93.2%) of these configurations do not have any special code for checking the correctness of their settings at the system’s initialization time.

To help software systems detect LC errors early, we present a tool named PCHECK that analyzes the source code and automatically generates configuration checking code (called checkers). The checkers emulate the late execution that uses configuration values, and detect LC errors if the error manifestations are captured during the emulated execution. Our results show that PCHECK can help systems detect 75+% of real-world LC errors at the initialization phase, including 37 new LC errors that have not been exposed before. Compared with existing detection tools, it can detect 31% more LC errors.


To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code

Matthew Curtis-Maury, Vinay Devadas, Vania Fang, and Aditya Kulkarni, NetApp, Inc.

12th USENIX Symposium on Operating Systems Design and Implementation

In order to achieve higher I/O throughput and better overall system performance, it is necessary for commercial storage systems to fully exploit the increasing core counts on modern systems. At the same time, legacy systems with millions of lines of code cannot simply be rewritten for improved scalability. In this paper, we describe the evolution of the multiprocessor software architecture (MP model) employed by the Netapp® Data ONTAP® WAFL® file system as a case study in incrementally scaling a production storage system.

The initial model is based on small-scale data partitioning, whereby user-file reads and writes to disjoint file regions are parallelized. This model is then extended with hierarchical data partitioning to manage concurrent accesses to important file system objects, thus benefiting additional workloads. Finally, we discuss a fine-grained lock-based MP model within the existing data-partitioned architecture to support workloads where data accesses do not map neatly to the predefined partitions. In these data partitioning and lock-based MP models, we have facilitated incremental advances in parallelism without a large-scale code rewrite, a major advantage in the multi-million line WAFL codebase. Our results show that we are able to increase CPU utilization by as much as 104% on a 20-core system, resulting in throughput gains of up to 130%. These results demonstrate the success of the proposed MP models in delivering scalable performance while balancing time-to-market requirements. The models presented can also inform scalable system redesign in other domains.


CoARC: Co-operative, Aggressive Recovery and Caching for Failures in Erasure Coded Hadoop

Pradeep Subedi, Ping Huang, Tong Liu, Virginia Commonwealth University, Joseph Moore, Stan Skelton, NetApp, Inc., Xubin He, Virginia Commonwealth University.

2016 International Conference on Parallel Processing (ICPP 2016)
Philadelphia, PA, USA

Cloud file systems like Hadoop have become a norm for handling big data because of the easy scaling and distributed storage layout. However, these systems are susceptible to failures and data needs to be recovered when a failure is detected. During temporary failures, MapReduce jobs or file system clients perform degraded reads and satisfy the read request. We argue that lack of sharing of the recovered data during degraded reads and recovery of only the requested data block places a heavy strain on the system’s network resources and increases the job execution time. To this end, we propose CoARC (Co-operative, Aggressive Recovery and Caching), which is a new data-recovery mechanism for unavailable data during degraded reads in distributed file systems. The main idea is to recover not only the data block that was requested but also other temporarily unavailable blocks in the same strip and cache them in a separate data node. We also propose an LRF (Least Recently Failed) cache replacement algorithm for such a kind of recovery caches. We also show that CoARC significantly reduces the network usage and job runtime in erasure coded Hadoop.


Think Global, Act Local: A Buffer Cache Design for Global Ordering and Parallel Processing in the WAFL File System

Peter Denz, Matthew Curtis-Maury, Vinay Devadas. NetApp, Inc.

2016 International Conference on Parallel Processing (ICPP 2016)
Philadelphia, PA, USA

Given the enormous disparity in access speeds between main memory and storage media, modern storage servers must leverage highly effective buffer cache policies to meet demanding performance requirements. At the same time, these page replacement policies need to scale efficiently with ever-increasing core counts and memory sizes, which necessitate parallel buffer cache management. However, these requirements of effectiveness and scalability are at odds, because centralized processing does not scale with more processors and parallel policies are a challenge to implement with maximum effectiveness. We have overcome this difficulty in the NetApp® Data ONTAP® WAFL® file system by using a sophisticated technique to simultaneously allow global buffer prioritization while providing parallel management operations. In addition, we have extended the buffer cache to provide a soft isolation of different workloads’ buffer cache usage, which is akin to buffer cache quality of server (QoS). This paper presents the design and implementation of these significant extensions in the buffer cache of a high-performance commercial file system.


StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs

Kenichi Yasukata, Michio Honda, Douglas Santry, and Lars Eggert

2016 USENIX Annual Technical Conference
Denver, CO

StackMap leverages the best aspects of kernel-bypass networking into a new low-latency OS network service based on the full-featured TCP kernel implementation, by dedicating network interfaces to applications and offering an extended version of the netmap API for zero-copy, low-overhead data path alongside control path based on socket API. For small-message, transactional workloads, StackMap outperforms baseline Linux by 4 to 78 % in latency and 42 to 133 % in throughput. It also achieves comparable performance with Seastar, a highly-optimized user-level TCP/IP stack that runs on top of DPDK.


The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments

FAST '16 Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien and Haryadi S. Gunawi

14th USENIX Conference on File and Storage Technologies (FAST ’16)
Santa Clara, CA

We study storage performance in over 450,000 disks and 4,000 SSDs over 87 days for an overall total of 857 million (disk) and 7 million (SSD) drive hours. We find that storage performance instability is not uncommon: 0.2% of the time, a disk is more than 2x slower than its peer drives in the same RAID group (and 0.6% for SSD). As a consequence, disk and SSD-based RAIDs experience at least one slow drive (i.e., storage tail) 1.5% and 2.2% of the time. To understand the root causes, we correlate slowdowns with other metrics (workload I/O rate and size, drive event, age, and model). Overall, we find that the primary cause of slowdowns are the internal characteristics and idiosyncrasies of modern disk and SSD drives. We observe that storage tails can adversely impact RAID performance, motivating the design of tail-tolerant RAID. To the best of our knowledge, this work is the most extensive documentation of storage performance instability in the field.


The Private Lives of Disk Drives

Rajesh Sundaram

NetApp builds resiliency into its storage systems at every level to ensure that critical data is always protected, including technologies such as SnapMirror®, SnapVault®, and SnapRestore® that protect you from events ranging from sitewide disasters to user and application errors. NetApp also offers a unique degree of resiliency against problems that occur within disk drives themselves. This paper described five of the most troublesome disk problems and the resiliency technologies that NetApp Engineering has developed to protect against them.