All posts by xing

Page Cache Attacks

Daniel Gruss, Erik Kraft, Graz University of Technology; and Trishita Tiwari, Boston University; Michael Schwarz, Graz University of Technology; Ari Trachtenberg, Boston University; Jason Hennessey, NetApp; and Alex Ionescu, CrowdStrike and Anders Fogh, Intel

Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security
November 11 – 15, 2019
London, United Kingdom

We present a new side-channel attack that targets one of the most fundamental software caches in modern computer systems: the operating system page cache. The page cache is a pure software cache that contains all disk-backed pages, including program binaries, shared libraries, and other files. On Windows, dynamic pages are also part of this cache and can be attacked as well, e.g., data, heap, and stacks. Our side channel permits unprivileged monitoring of accesses to these pages of other processes, with a spatial resolution of 4kB and a temporal resolution of 2µs on Linux (≤6.7 measurements per second), and 466ns on Windows 10 (≤223 measurements per second). We systematically analyze the side channel by demonstrating different hardware-agnostic local attacks, including a sandbox-bypassing high-speed covert channel, an ASLR break on Windows 10, and various information leakages that can be used for targeted extortion, spam campaigns, and more directly for UI redressing attacks. We also show that, as with hardware cache attacks, we can attack the generation of temporary passwords on vulnerable cryptographic implementations. Our hardware-agnostic attacks can be mitigated with our proposed security patches, but the basic side channel remains exploitable via timing measurements. We demonstrate this with a remote covert channel exfiltrating information from a colluding process through innocuous server requests.


  • A copy of the paper can be found at: link.

On the Universally Composable Security of OpenStack

Hoda Maleki (University of Connecticut); Kyle Hogan (MIT); Reza Rahaeimehr (University of Connecticut); Ran Canetti, Mayank Varia, Jason Hennessey (Boston University and NetApp); Marten van Dijk (University of Connecticut); Haibin Zhang (UMBC)

IEEE Secure Development Conference
September 25 – 27, 2019
McLean, VA

We initiate an effort to provide a rigorous, holisticand modular security analysis of OpenStack. OpenStack is theprevalent open-source, non-proprietary package for managingcloud services and data centers. It is highly complex and consistsof multiple inter-related components which are developed byseparate, loosely coordinated groups. All of these properties makethe security analysis of OpenStack both a worthy mission and achallenging one. We base our modeling and security analysis inthe universally composable (UC) security framework. This allowsspecifying and proving security in a modular way — a crucialfeature when analyzing systems of such magnitude. Our analysishas the following key features:
1) It isuser-centric: It stresses the security guarantees givento users of the system in terms of privacy, correctness, andtimeliness of the services.
2) It considers the security of OpenStack even when some ofthe components are compromised. This departs from thetraditional design approach of OpenStack, which assumesthat all services are fully trusted.
3) It ismodular: It formulates security properties for individualcomponents and uses them to prove security properties ofthe overall system.

Specifically, this work concentrates on the high-level struc-ture of OpenStack, leaving the further formalization and moredetailed analysis of specific OpenStack services to future work.Specifically, we formulate ideal functionalities that correspond tosome of the core OpenStack modules, and then proves securityof the overall OpenStack protocol given the ideal components.

As demonstrated within, the main challenge in the high-level design is to provide adequately fine-grained scoping ofpermissions to access dynamically changing system resources.We demonstrate security issues with current mechanisms in caseof failure of some components, propose alternative mechanisms,and rigorously prove adequacy of then new mechanisms within ourmodeling.


  • A copy of the paper can be found at: link.

FlexGroup Volumes: A Distributed WAFL File System

Ram Kesavan, Google; Jason Hennessey, Richard Jernigan, Peter Macko, Keith A. Smith, Daniel Tennant, and Bharadwaj V. R., NetApp

2019 USENIX Annual Technical Conference

The rapid growth of customer applications and datasets has led to demand for storage that can scale with the needs of modern workloads. We have developed FlexGroup volumes to meet this need. FlexGroups combine local WAFL® file systems in a distributed storage cluster to provide a single namespace that seamlessly scales across the aggregate resources of the cluster (CPU, storage, etc.) while preserving the features and robustness of the WAFL file system.

In this paper we present the FlexGroup design, which includes a new remote access layer that supports distributed transactions and the novel heuristics used to balance load and capacity across a storage cluster. We evaluate FlexGroup performance and efficacy through lab tests and field data from over 1,000 customer FlexGroups.


Managing Response Time Tails by Sharding

P. G. Harrison, Imperial College London; N. M. Patel, NetApp Inc; J. F. Pérez, Universidad del Rosario; Z. Qiu, Imperial College London

ACM Transactions on Modeling and Performance Evaluation of Computing Systems (TOMPECS)
Volume 4 Issue 1, March 2019
Article No. 5

Matrix analytic methods are developed to compute the probability distribution of response times (i.e., data access times) in distributed storage systems protected by erasure coding, which is implemented by sharding a data object into N fragments, only K<; N of which are required to reconstruct the object. This leads to a partial-fork-join model with a choice of canceling policies for the redundant N−K tasks. The accuracy of the analytical model is supported by tests against simulation in a broad range of setups. At increasing workload intensities, numerical results show the extent to which increasing the redundancy level reduces the mean response time of storage reads and significantly flattens the tail of their distribution; this is demonstrated at medium-high quantiles, up to the 99th. The quantitative reduction in response time achieved by two policies for canceling redundant tasks is also shown: for cancel-at-finish and cancel-at-start, which limits the additional load introduced whilst losing the benefit of selectivity amongst fragment service times.


Storage Gardening: Using a Virtualization Layer for Efficient Defragmentation in the WAFL File System

Ram Kesavan, Matthew Curtis-Maury, Vinay Devadas, and Kesari Mishra, NetApp

7th USENIX Conference on File and Storage Technologies (FAST)
FEBRUARY 25–28, 2019

As a file system ages, it can experience multiple forms of fragmentation. Fragmentation of the free space in the file system can lower write performance and subsequent read performance. Client operations as well as internal operations, such as deduplication, can fragment the layout of an individual file, which also impacts file read performance. File systems that allow sub-block granular addressing can gather intra-block fragmentation, which leads to wasted free space. This paper describes how the NetApp® WAFL® file system leverages a storage virtualization layer for defragmentation techniques that physically relocate blocks efficiently, including those in read-only snapshots. The paper analyzes the effectiveness of these techniques at reducing fragmentation and improving overall performance across various storage media.


TDDFS: A Tier-Aware Data Deduplication-Based File System

Zhichao Cao, Hao Wen, University of Minnesota; Xiongzi Ge, NetApp; Jingwei Ma, Nankai University; Jim Diehl, David H. C. Du; University of Minnesota

ACM Transactions on Storage (TOS)
Volume 15 Issue 1, March 2019

With the rapid increase in the amount of data produced and the development of new types of storage devices, storage tiering continues to be a popular way to achieve a good tradeoff between performance and cost-effectiveness. In a basic two-tier storage system, a storage tier with higher performance and typically higher cost (the fast tier) is used to store frequently-accessed (active) data while a large amount of less-active data are stored in the lower-performance and low-cost tier (the slow tier). Data are migrated between these two tiers according to their activity. In this article, we propose a Tier-aware Data Deduplication-based File System, called TDDFS, which can operate efficiently on top of a two-tier storage environment.

Specifically, to achieve better performance, nearly all file operations are performed in the fast tier. To achieve higher cost-effectiveness, files are migrated from the fast tier to the slow tier if they are no longer active, and this migration is done with data deduplication. The distinctiveness of our design is that it maintains the non-redundant (unique) chunks produced by data deduplication in both tiers if possible. When a file is reloaded (called a reloaded file) from the slow tier to the fast tier, if some data chunks of the file already exist in the fast tier, then the data migration of these chunks from the slow tier can be avoided. Our evaluation shows that TDDFS achieves close to the best overall performance among various file-tiering designs for two-tier storage systems.


Yodea: Workload Pattern Assessment Tool for Cloud Migration

Rukma Talwadker and Cijo George, NetApp

2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom)
10-13 Dec. 2018

As the news around cloud repatriations gets real, many cloud technologists associate them with poor understanding of the applications and their usage patterns by the enterprises. Our solution, Yodea, is a tool cum methodology to analyze work-load patterns in the light of cloud suitability. We bring forward compute patterns which can benefit from cloud economics with on-demand compute scaling. Yodea further ranks workloads in terms of their cloud suitability on the basis of these metrics. After the fact analysis of storage workloads for a customer install-base, features 38% of the “already in cloud” volumes in the top 100 ranked list by Yodea.


Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems

Haryadi S. Gunawi and Riza O. Suminto, University of Chicago; Russell Sears and Casey Golliher, Pure Storage; Swaminathan Sundararaman, Parallel Machines; Xing Lin and Tim Emami, NetApp; Weiguang Sheng and Nematollah Bidokhti, Huawei; Caitie McCaffrey, Twitter; Gary Grider and Parks M. Fields, Los Alamos National Laboratory; Kevin Harms and Robert B. Ross, Argonne National Laboratory; Andree Jacobson, New Mexico Consortium; Robert Ricci and Kirk Webb, University of Utah; Peter Alvaro, University of California, Santa Cruz, Mingzhe Hao, Huaicheng Li, and H. Birali Runesha, University of Chicago

ACM Transactions on Storage (TOS) TOS
Volume 14 Issue 3, October 2018
Article No. 23

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.


ChewAnalyzer: Workload-Aware Data Management Across Differentiated Storage Pools

Xiongzi Ge, NetApp, Inc.; Xuchao Xie, NUDT; David H.C. Du, University of Minnesota; Pradeep Ganesan, NetApp, Inc.; Dennis Hahn, NetApp, Inc.;

the 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2018)

September 25 – 28, 2018
Milwaukee, Wisconsin, US

In multi-tier storage systems, moving data from one tier to the next can be inefficient. And because each type of storage device has its own idiosyncrasies with respect to the workloads that it can best support, unnecessary data movement might result. In this paper, we explore a fully connected storage architecture in which data can move from any storage pool to another. We propose a Chunk-level storage-aware workload Analyzer framework, abbreviated as ChewAnalyzer, to facilitate efficient data placement. Access patterns are characterized in a flexible way by a collection of I/O accesses to a data chunk. ChewAnalyzer employs a Hierarchical Classifier [30] to analyze the chunk patterns step by step. In each classification step, the Chunk Placement Recommender suggests new data placement policies according to the device properties. Based on the analysis of access pattern changes, the Storage Manager can adequately distribute or migrate the data chunks across different storage pools. Our experimental results show that ChewAnalyzer improves the initial data placement and that it migrates data into the proper pools directly and efficiently.


A model-based approach to streamlining distributed training for asynchronous SGD

Sung-Han Lin, NetApp; Marco Paolieri, University of Southern California; Cheng-Fu Chou, National Taiwan University; Leana Golubchik, University of Southern California

the 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS 2018)

September 25 – 28, 2018
Milwaukee, Wisconsin, US

The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools,
hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. A common approach to speeding up an individual job is to distribute training data and computation among multiple nodes, periodically exchanging intermediate results. In this paper, we address two important problems for the application of this strategy to large-scale clusters and multiple, heterogeneous jobs. First, we propose and validate a queueing model to estimate the throughput of a training job as a function of the number of nodes assigned to the job; this model targets asynchronous Stochastic Gradient Descent (SGD), a popular strategy for distributed training, and requires only data from quick, two-node profiling in addition to job characteristics (number of requested training epochs, mini-batch size, size of DNN parameters, assigned bandwidth). Throughput estimations are then used to explore several classes of scheduling heuristics to reduce response time in a scenario where heterogeneous jobs are continuously submitted to a large-scale cluster. These scheduling algorithms dynamically select which jobs to run and how many nodes to assign to each job, based on different trade-offs between service time reduction and efficiency (e.g., speedup per additional node). Heuristics are evaluated through extensive simulations of realistic DNN workloads, also investigating the effects of early termination, a common scenario for DNN training jobs.