Tag Archives: usenix

FlexGroup Volumes: A Distributed WAFL File System

Ram Kesavan, Google; Jason Hennessey, Richard Jernigan, Peter Macko, Keith A. Smith, Daniel Tennant, and Bharadwaj V. R., NetApp

2019 USENIX Annual Technical Conference

The rapid growth of customer applications and datasets has led to demand for storage that can scale with the needs of modern workloads. We have developed FlexGroup volumes to meet this need. FlexGroups combine local WAFL® file systems in a distributed storage cluster to provide a single namespace that seamlessly scales across the aggregate resources of the cluster (CPU, storage, etc.) while preserving the features and robustness of the WAFL file system.

In this paper we present the FlexGroup design, which includes a new remote access layer that supports distributed transactions and the novel heuristics used to balance load and capacity across a storage cluster. We evaluate FlexGroup performance and efficacy through lab tests and field data from over 1,000 customer FlexGroups.


Early Detection of Configuration Errors to Reduce Failure Damage

Tianyin Xu, Xinxin Jin, Peng Huang, and Yuanyuan Zhou, University of California, San Diego; Shan Lu, University of Chicago; Long Jin, University of California, San Diego; Shankar Pasupathy, NetApp, Inc.

Awarded Best Paper
12th USENIX Symposium on Operating Systems Design and Implementation

Early detection is the key to minimizing failure damage induced by configuration errors, especially those errors in configurations that control failure handling and fault tolerance. Since such configurations are not needed for initialization, many systems do not check their settings early (e.g., at startup time). Consequently, the errors become latent until their manifestations cause severe damage, such as breaking the failure handling. Such latent errors are likely to escape from sysadmins’ observation and testing, and be deployed to production at scale.

Our study shows that many of today’s mature, widely-used software systems are subject to latent configuration errors (referred to as LC errors) in their critically important configurations—those related to the system’s reliability, availability, and serviceability. One root cause is that many (14.0%–93.2%) of these configurations do not have any special code for checking the correctness of their settings at the system’s initialization time.

To help software systems detect LC errors early, we present a tool named PCHECK that analyzes the source code and automatically generates configuration checking code (called checkers). The checkers emulate the late execution that uses configuration values, and detect LC errors if the error manifestations are captured during the emulated execution. Our results show that PCHECK can help systems detect 75+% of real-world LC errors at the initialization phase, including 37 new LC errors that have not been exposed before. Compared with existing detection tools, it can detect 31% more LC errors.


To Waffinity and Beyond: A Scalable Architecture for Incremental Parallelization of File System Code

Matthew Curtis-Maury, Vinay Devadas, Vania Fang, and Aditya Kulkarni, NetApp, Inc.

12th USENIX Symposium on Operating Systems Design and Implementation

In order to achieve higher I/O throughput and better overall system performance, it is necessary for commercial storage systems to fully exploit the increasing core counts on modern systems. At the same time, legacy systems with millions of lines of code cannot simply be rewritten for improved scalability. In this paper, we describe the evolution of the multiprocessor software architecture (MP model) employed by the Netapp® Data ONTAP® WAFL® file system as a case study in incrementally scaling a production storage system.

The initial model is based on small-scale data partitioning, whereby user-file reads and writes to disjoint file regions are parallelized. This model is then extended with hierarchical data partitioning to manage concurrent accesses to important file system objects, thus benefiting additional workloads. Finally, we discuss a fine-grained lock-based MP model within the existing data-partitioned architecture to support workloads where data accesses do not map neatly to the predefined partitions. In these data partitioning and lock-based MP models, we have facilitated incremental advances in parallelism without a large-scale code rewrite, a major advantage in the multi-million line WAFL codebase. Our results show that we are able to increase CPU utilization by as much as 104% on a 20-core system, resulting in throughput gains of up to 130%. These results demonstrate the success of the proposed MP models in delivering scalable performance while balancing time-to-market requirements. The models presented can also inform scalable system redesign in other domains.


StackMap: Low-Latency Networking with the OS Stack and Dedicated NICs

Kenichi Yasukata, Michio Honda, Douglas Santry, and Lars Eggert

2016 USENIX Annual Technical Conference
Denver, CO

StackMap leverages the best aspects of kernel-bypass networking into a new low-latency OS network service based on the full-featured TCP kernel implementation, by dedicating network interfaces to applications and offering an extended version of the netmap API for zero-copy, low-overhead data path alongside control path based on socket API. For small-message, transactional workloads, StackMap outperforms baseline Linux by 4 to 78 % in latency and 42 to 133 % in throughput. It also achieves comparable performance with Seastar, a highly-optimized user-level TCP/IP stack that runs on top of DPDK.


The Tail at Store: A Revelation from Millions of Hours of Disk and SSD Deployments

FAST '16 Mingzhe Hao, Gokul Soundararajan, Deepak Kenchammana-Hosekote, Andrew A. Chien and Haryadi S. Gunawi

14th USENIX Conference on File and Storage Technologies (FAST ’16)
Santa Clara, CA

We study storage performance in over 450,000 disks and 4,000 SSDs over 87 days for an overall total of 857 million (disk) and 7 million (SSD) drive hours. We find that storage performance instability is not uncommon: 0.2% of the time, a disk is more than 2x slower than its peer drives in the same RAID group (and 0.6% for SSD). As a consequence, disk and SSD-based RAIDs experience at least one slow drive (i.e., storage tail) 1.5% and 2.2% of the time. To understand the root causes, we correlate slowdowns with other metrics (workload I/O rate and size, drive event, age, and model). Overall, we find that the primary cause of slowdowns are the internal characteristics and idiosyncrasies of modern disk and SSD drives. We observe that storage tails can adversely impact RAID performance, motivating the design of tail-tolerant RAID. To the best of our knowledge, this work is the most extensive documentation of storage performance instability in the field.


Lamassu: Storage-Efficient Data-Source Encryption

fast15_button_125Peter Shah, and Won So

Many storage customers are adopting encryption solutions to protect themselves against data leakage or theft. Encryption solutions are already on the market, many of which take the form of encryption solutions that sit in, or near, the application that is the source of critical data. We refer to this deployment strategy as data-source encryption. Placing encryption near the source makes it easy to guarantee that data remains encrypted downstream of the application, enabling the use of untrusted storage,such as public clouds. Unfortunately, data-source encryption encryption also prevents downstream storage systems from applying content-based data management features, such as data deduplication to the data. In this paper, we present Lamassu, an alternative encryption solution that provides strong, data-source encryption, while preserving downstream storage-based data deduplication. Lamassu uses a convergent encryption strategy to provide this service, and,unlike past convergent encryption systems, securely inserts encryption metadata into the data stream, rather than placing it in a dedicated store. This allows us to use existing systems without requiring any modification to either the client application or the storage controller. In this paper we will lay out the architecture and security model used in our prototype system, and provide an analysis of its performance under a variety of circumstances. Our performance analysis will show that our system provides excellent storage efficiency, while achieving I/O throughput on par with similar conventional encryption systems.


Warming Up Storage-Level Caches with Bonfire

fast13_button_125.pngY. Zhang, G. Soundararajan, M. W. Storer, L. N. Bairavasundaram, S. Subbiah, A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau

Bonfire is a mechanism for accelerating cache warmup for large caches so that application service levels can be met significantly sooner than would be possible with on-demand warmup.

Large caches in storage servers have become essential for meeting service levels required by applications. These caches need to be warmed with data often today due to various scenarios including dynamic creation of cache space and server restarts that clear cache contents. When large storage caches are warmed at the rate of application I/O, warmup can take hours or even days, thus affecting both application performance and server load over a long period of time.

We have created Bonfire, a mechanism for accelerating cache warmup. Bonfire monitors storage server workloads, logs important warmup data, and efficiently preloads storage-level caches with warmup data. Bonfire is based on our detailed analysis of block-level data-center traces that provides insights into heuristics for warmup as well as the potential for efficient mechanisms. We show through both simulation and trace replay that Bonfire reduces both warmup time and backend server load significantly, compared to a cache that is warmed up on demand.

In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13)

San Jose, February, 2013.



Storage and Network Deduplication Technologies

fast10_button.jpgMichael Condict

This tutorial provided a detailed look at the multitude of ways that deduplication can be used to improve the efficiency of storage and networking devices.

A Tutorial Presented at the USENIX Conference on File and Storage Technologies 2010 (FAST ’10)


Economic and environmental concerns are currently motivating a push across the computing industry to do more with less: less energy and less money. Deduplication of data is one of the most effective tools to accomplish this. Removing redundant copies of stored data reduces hardware requirements, lowering capital expenses and using less power. Avoiding sending the same data repeatedly across a network increases the effective bandwidth of the link, reducing networking expenses.

This tutorial provided a detailed look at the multitude of ways deduplication can be used to improve the efficiency of storage and networking devices. It consisted of two parts.

The first part introduced the basic concepts of deduplication and compared it to the related technique of file compression. A taxonomy of basic deduplication techniques was covered, including the unit of deduplication (file, block, or variable-length segment), the deduplication scope (file system, storage system, or cluster), in-line vs. background deduplication, trusted fingerprints, and several other design choices. The relative merits of each were analyzed.

The second part discussed advanced techniques, such as the use of fingerprints other than a content hash to uniquely identify data, techniques for deduplicating across a storage cluster, and the use of deduplication within a client-side cache.


Fido: Fast Inter-Virtual-Machine Communication for Enterprise Appliances

usenix09_button.jpgA. Burtsev, K. Srinivasan, P. Radhakrishnan, L.N. Bairavasundaram, K. Voruganti, and G.R. Goodson.

Fido is a high-performance inner-VM communication mechanism between VMs on an enterprise appliance.

Enterprise-class server appliances such as network-attached storage systems or network routers can benefit greatly from virtualization technologies. However, current inter-VM communication techniques have significant performance overheads when employed between highly-collaborative appliance components, thereby limiting the use of virtualization in such systems. We present present Fido, an inter-VM communication mechanism that leverages the inherent relaxed trust model between the software components in an appliance to achieve high performance. We have also developed common device abstractions – a network device (MMNet) and a block device (MMBlk) on top of Fido.

We evaluate MMNet and MMBlk using microbenchmarks and find that they outperform existing alternative mechanisms. As a case study, we have implemented a virtualized architecture for a network-attached storage system incorporating Fido, MMNet, and MMBlk. We use both microbenchmarks and TPC-C to evaluate our virtualized storage system architecture. In comparison to a monolithic architecture, the virtualized one exhibits nearly no performance penalty in our benchmarks, thus demonstrating the viability of virtualized enterprise server architectures that use Fido.

In Proceedings of the USENIX Annual Technical Conference 2009 (USENIX ’09)


  • A copy of the paper is attached to this posting.


CLUEBOX: A Performance Log Analyzer for Automated Troubleshooting

wasl08_logo.jpgS. Ratna Sandeep, M. Swapna, Thirumale Niranjan, Sai Susarla, and Siddhartha Nandi.

This paper discusses CLUEBOX, a non-intrusive toolkit that aids rapid problem diagnosis by characterizing workloads, predicting performance, and discovering anomalous behavior.

Performance problems in complex systems are often caused by underprovisioning, workload interference, incorrect expectations or bugs. Troubleshooting such systems is a difficult task faced by service engineers. We have built CLUEBOX, a non-intrusive toolkit that aids rapid problem diagnosis. It employs machine learning techniques on the available performance logs to characterize workloads, predict performance and discover anomalous behavior. By identifying the most relevant anomalies to focus on, CLUEBOX automates the most onerous aspects of performance troubleshooting. We have experimentally validated our methodology in a networked storage environment with real workloads. Using CLUEBOX to learn from a set of historical performance observations, we were able to distill over 2000 performance counters into 68 counters that succinctly describe a running workload. Further, we demonstrate effective troubleshooting of two scenarios that adversely impacted application response time: (1) an unknown competing workload, and (2) a file system consistency checker. By reducing the set of anomalous counters to examine to a dozen significant ones, CLUEBOX was able to guide a systems engineer towards identifying the correct root-cause rapidly.

In Proceedings of the USENIX Workshop on Analysis of System Logs 2008 (WASL ’08)


  • A copy of the paper is attached to this posting.
  • The presentation slides from the workshop are also attached to this posting.