Ram Kesavan, Google; Jason Hennessey, Richard Jernigan, Peter Macko, Keith A. Smith, Daniel Tennant, and Bharadwaj V. R., NetApp
2019 USENIX Annual Technical Conference
The rapid growth of customer applications and datasets has led to demand for storage that can scale with the needs of modern workloads. We have developed FlexGroup volumes to meet this need. FlexGroups combine local WAFLÂ® file systems in a distributed storage cluster to provide a single namespace that seamlessly scales across the aggregate resources of the cluster (CPU, storage, etc.) while preserving the features and robustness of the WAFL file system.
In this paper we present the FlexGroup design, which includes a new remote access layer that supports distributed transactions and the novel heuristics used to balance load and capacity across a storage cluster. We evaluate FlexGroup performance and efficacy through lab tests and field data from over 1,000 customer FlexGroups.
Ram Kesavan, Matthew Curtis-Maury, Vinay Devadas, and Kesari Mishra, NetApp
7th USENIX Conference on File and Storage Technologies (FAST)
FEBRUARY 25–28, 2019
BOSTON, MA, USA
As a file system ages, it can experience multiple forms of fragmentation. Fragmentation of the free space in the file system can lower write performance and subsequent read performance. Client operations as well as internal operations, such as deduplication, can fragment the layout of an individual file, which also impacts file read performance. File systems that allow sub-block granular addressing can gather intra-block fragmentation, which leads to wasted free space. This paper describes how the NetApp® WAFL® file system leverages a storage virtualization layer for defragmentation techniques that physically relocate blocks efficiently, including those in read-only snapshots. The paper analyzes the effectiveness of these techniques at reducing fragmentation and improving overall performance across various storage media.
Ram Kesavan, Matthew Curtis-Maury, and Mrinal Bhattacharjee (NetApp)
47th International Conference on Parallel Processing (ICPP 2018)
August 13-16, 2018 Eugene, Oregon, USA
The WAFL write allocator is responsible for assigning blocks on persistent storage to data in a way that maximizes both write throughput to the storage media and subsequent read performance of data. The ability to quickly and efficiently guide the write allocator toward desirable regions of available free space is critical to achieving that goal. This ability is influenced by several factors, such as any underlying RAID geometry, media-specific attributes such as erase-block size of solid state drives or zone size of shingled magnetic hard drives, and free space fragmentation. This paper presents and evaluates the techniques used by the WAFL write allocator to efficiently find regions of free space.
Deng Zhou, San Diego State University; Vania Fang, NetApp; Tao Xie, Wen Pan, San Diego State University; Ram Kesavan, Tony Lin, and Naresh Patel, NetApp
ACM Transactions on Storage (TOS) Volume 14 Issue 2, May 201 Article No. 14
Since little has been reported in the literature concerning enterprise storage system file-level request scheduling, we do not have enough knowledge about how various scheduling factors affect performance. Moreover, we are in lack of a good understanding on how to enhance request scheduling to adapt to the changing characteristics of workloads and hardware resources. To answer these questions, we first build a request scheduler prototype based on WAFL®, a mainstream file system running on numerous enterprise storage systems worldwide. Next, we use the prototype to quantitatively measure the impact of various scheduling configurations on performance on a NetApp®’s enterprise-class storage system. Several observations have been made. For example, we discover that in order to improve performance, the priority of write requests and non-preempted restarted requests should be boosted in some workloads. Inspired by these observations, we further propose two scheduling enhancement heuristics called SORD (size-oriented request dispatching) and QATS (queue-depth aware time slicing). Finally, we evaluate them by conducting a wide range of experiments using workloads generated by SPC-1 and SFS2014 on both HDD-based and all-flash platforms. Experimental results show that the combination of the two can noticeably reduce average request latency under some workloads.
Ram Kesavan, NetApp, Inc.; Harendra Kumar, Composewell Technologies; Sushrut Bhowmik, NetApp, Inc.
The 16th USENIX Conference on File and Storage Technologies
FEBRUARY 12–15, 2018
OAKLAND, CA, USA
Consistent and timely access to an arbitrarily damaged file system is an important requirement of enterprise class systems. Repairing file system inconsistencies is accomplished most simply when file system access is limited to the repair tool. Checking and repairing a file system while it is open for general access present unique challenges. In this paper, we explore these challenges, present our online repair tool for the NetApp® WAFL® file system, and show how it achieves the same results as offline repair even while client access is enabled. We present some implementation details and evaluate its performance. To the best of our knowledge, this publication is the first to describe a fully functional online repair tool.
Ram Kesavan, Rohit Singh, Travis Grusecki, NetApp Inc. Yuvraj Patel, University of Wisconsin-Madison
ACM Transactions on Storage (TOS)
Volume 13 Issue 3, September 2017
Article No. 23
NetApp®WAFL® is a transactional file system that uses the copy-on-write mechanism to support fast write performance and efficient snapshot creation. However, copy-on-write increases the demand on the file system to find free blocks quickly, which makes rapid free space reclamation essential. Inability to find free blocks quickly may impede allocations for incoming writes. Efficiency is also important, because the task of reclaiming free space may consume CPU and other resources at the expense of client operations. In this article, we describe the evolution (over more than a decade) of the WAFL algorithms and data structures for reclaiming space with minimal impact to the overall performance of the storage appliance.
Matthew Curtis-Maury, Ram Kesavan, and Mrinal K. Bhattacharjee. NetApp, Inc
The 46th International Conference on Parallel Processing (ICPP-2017)
August 14-17, 2017 Bristol, UK
Enterprise storage systems must scale to increasing core counts to meet stringent performance requirements. Both the NetApp® Data ONTAP® storage operating system and its WAFL® file system have been incrementally parallelized over the years, but some components remain single-threaded. The WAFL write allocator, which is responsible for assigning blocks on persistent storage to dirty data in a way that maximizes write throughput to the storage media, is single-threaded and has become a major scalability bottleneck. This paper presents a new write allocation architecture, White Alligator, for the WAFL file system that scales performance on many cores. We also place the new architecture in the context of the historical parallelization of WAFL and discuss the architectural decisions that have facilitated this parallelism. The resulting system demonstrates increased scalability that results in throughput gains of up to 274% on a many-core storage system.
The paper attached below is the authors’ version of the work. It is posted here for your personal use. Not for redistribution.
The definitive version was published in Proceedings of the 46th International Conference on Parallel Processing (ICPP-2017) Bristol, UK, August 14-17, 2017.
Harendra Kumar; Yuvraj Patel, University of Wisconsin—Madison; Ram Kesavan and Sumith Makam, NetApp
15th USENIX Conference on File and Storage Technologies (FAST 2017)
Feb. 27 – March 2, 2017 Santa Clara, CA
We introduce a low-cost incremental checksum technique that protects metadata blocks against in-memory scribbles, and a lightweight digest-based transaction auditing mechanism that enforces file system consistency invariants. Compared with previous work, our techniques reduce performance overhead by an order of magnitude. They also help distinguish scribbles from logic bugs. We also present a mechanism to pinpoint the cause of scribbles on production systems. Our techniques have been productized in the NetApp® WAFL® (Write Anywhere File Layout) file system with negligible performance overhead, greatly reducing corruption-related incidents over the past five years, based on millions of runtime hours.
Ram Kesavan, Rohit Singh, and Travis Grusecki, NetApp; Yuvraj Patel, University of Wisconsin—Madison
15th USENIX Conference on File and Storage Technologies (FAST 2017)
Feb. 27 – March 2, 2017 Santa Clara, CA
NetApp®WAFL®is a transactional file system that uses the copy-on-write mechanism to support fast write performance and efficient snapshot creation. However, copy-on-write increases the demand on the file system to find free blocks quickly; failure to do so may impede allocations for incoming writes. Efficiency is also important, because the task may consume CPU and other resources. In this paper, we describe the evolution (over more than a decade) of WAFL’s algorithms and data structures for reclaiming space with minimal impact on the overall storage appliance performance.
Matthew Curtis-Maury, Vinay Devadas, Vania Fang, and Aditya Kulkarni, NetApp, Inc.
12th USENIX Symposium on Operating Systems Design and Implementation
In order to achieve higher I/O throughput and better overall system performance, it is necessary for commercial storage systems to fully exploit the increasing core counts on modern systems. At the same time, legacy systems with millions of lines of code cannot simply be rewritten for improved scalability. In this paper, we describe the evolution of the multiprocessor software architecture (MP model) employed by the Netapp® Data ONTAP® WAFL® file system as a case study in incrementally scaling a production storage system.
The initial model is based on small-scale data partitioning, whereby user-file reads and writes to disjoint file regions are parallelized. This model is then extended with hierarchical data partitioning to manage concurrent accesses to important file system objects, thus benefiting additional workloads. Finally, we discuss a fine-grained lock-based MP model within the existing data-partitioned architecture to support workloads where data accesses do not map neatly to the predefined partitions. In these data partitioning and lock-based MP models, we have facilitated incremental advances in parallelism without a large-scale code rewrite, a major advantage in the multi-million line WAFL codebase. Our results show that we are able to increase CPU utilization by as much as 104% on a 20-core system, resulting in throughput gains of up to 130%. These results demonstrate the success of the proposed MP models in delivering scalable performance while balancing time-to-market requirements. The models presented can also inform scalable system redesign in other domains.