The architecture of a data system is defined by the chosen data layouts, data access algorithms, data models and data flow methods. Different applications and hardware require a different architecture design for optimal performance (speed, energy). Yet, so far all data systems are static, operating within a single and narrowly defined design space (NoSQL, NewSQL, SQL) and hardware profile. Historically, a new data system architecture requires at least a decade to reach a stable design. However, hardware and applications evolve rapidly and continuously, leaving data-driven applications locked with sub-optimal systems or with systems that simply do not have the desired functionality or the right data model. Our goal in this project is to make it extremely easy to design and test a new data system in a matter of a few hours or days as opposed to several years. Given a data set, a query workload and a hardware profile, a self-designing data system evolves such that its architecture matches the properties of the environment. The whole system design is generated automatically and adaptively by being able to create numerous individual system components that can be combined to synthesize alternative full system designs. A self-designing system continuously performs automatic synthesis of components to evaluate new designs at the lowest levels of database architectures such the data layout, access methods and execution strategies. This research creates opportunities to bootstrap new applications, to automatically create systems tailored for specific scenarios, to minimize system footprint as well as to automatically adapt to new hardware.
32nd International Conference on Massive Storage Systems and Technology
Santa Clara, CA
Object-based cloud storage has been widely adopted for their agility in deploying storage with a very low up-front cost. However, enterprises currently use them to store secondary data and not for expensive primary data. The driving reason is performance; most enterprises conclude that storing primary data in the cloud will not deliver the performance needed to serve typical workloads. Our analysis of real-world traces shows that certain primary data sets can reside in the cloud with its working set cached locally, using a cloud gateway that acts as a caching bridge between local data centers and the cloud. We use a realistic cloud gateway simulator to study the performance and cost of moving such workloads to different cloud backends (like Amazon S3). We find that when equipped with the right techniques, cloud gateways can provide competitive performance and price compared to on-premise storage. We also provide insights on how to build such cloud gateways, especially with respect to caching and prefetching techniques.
- The definitive version of the paper can be found at:
In computer systems, caches create the ability to simply and effectively boost the performance of any downstream layer, either storage or memory. Literature is rife with a variety of cache replacement algorithms that have optimized cache utilization irrespective of the workload. Unfortunately, conventional cache replacement algorithms have been designed for datapathcaches, wherein each cache miss leads to a cache insertion operation and in most cases, a cache eviction operation as well. These cache updating operations are expensive, often unnecessary, and in many cases counter-productive to cache performance. Non-datapath caches, on the other hand, are not required to perform a cache update on every cache miss. Thus, one can apply opportunistic cache updates, whereby case-by-case decisions can be made whether to perform a cache update. A host-side flash cache is an example of a non-datapath cache. Host-side flash caches are attractive because they can reduce the demands placed on network storage, speed up I/O performance, and provide I/O latency and throughput control.
In initial work, we demonstrated that the state-of-the-art external (or second-level) caching algorithm, ARC, had significant shortcomings which compromise both the cache hit-rate as well as the cache update rate. When used with a non-datapath cache, such as a host-side flash cache, this results in not just performance loss but also compromised device lifetime. We then demonstrated using production enterprise workloads that selectively activating the ARC replacement mechanisms based on workload states can offer significant benefit in terms of both cache hit-rate and update-rate.
In this project, we will investigate and develop an entirely new class of cache replacement algorithms that are tailored specifically for non-datapath caches. To this end, our focus will be on developing and evaluating algorithms that are aware of workload states and that can intelligently and selectively perform cache updates such as to benefit both cache hit-rate and cache update-rate, and consequently, cache device lifetime. Our past work was limited to modifying ARC and more significantly, the algorithm we evaluated used a set of statically chosen parameters for enabling the selective caching mechanisms we built. In this project, we will build upon this initial work to make selective caching independent of the underlying caching algorithm and evaluate it for a variety of conventional (e.g., LRU, CLOCK) as well as newer (e.g., LIRS, MQ) cache replacement algorithms. Further, we will develop adaptive versions of the selective caching algorithm so that workload-based parameter configuration is no longer necessary.
ACM Transactions on Storage (TOS)
Volume 12 Issue 1, January 2016
File-system snapshots have been a key component of enterprise storage management since their inception. Creating and managing them efficiently, while maintaining flexibility and low overhead, has been a constant struggle. Although the current state-of-the-art mechanism—hierarchical reference counting—performs reasonably well for traditional small-file workloads, these workloads are increasingly vanishing from the enterprise data center, replaced instead with virtual machine and database workloads. These workloads center around a few very large files, violating the assumptions that allow hierarchical reference counting to operate efficiently. To better cope with these workloads, we introduce Generational Chain Trees (GCTrees), a novel method of space management that uses concepts of block lineage across snapshots rather than explicit reference counting. As a proof of concept, we create a prototype file system—gcext4, a modified version of ext4 that uses GCTrees as a basis for snapshots and copy-on-write. In evaluating this prototype empirically, we find that although they have a somewhat higher overhead for traditional workloads, GCTrees have dramatically lower overhead than hierarchical reference counting for large-file workloads, improving by a factor of 34 or more in some cases. Furthermore, gcext4 performs comparably to ext4 across all workloads, showing that GCTrees impose minor cost for their benefits.
- The definitive version of the paper can be found at:
Due to lack of generic, accurate, dynamic and comprehensive models for performance estimation, customers typically tend to under- provision or over-provision the storage systems today. With multi-tenancy, virtualization, scale and unified storage becoming the norm in the industry, it is highly desirable to strike an optimum balance for utilization and performance. However, performance prediction for enterprise storage systems is a tricky problem, consider- ing that there are multiple hardware and soft- ware layers cascaded in a complex way that affect the behavior of the system. Configuration factors such as CPU, cache size, RAM size, capacity, storage backend (HDD/Flash) and net- work cards etc. are known to have a significant effect on number of IOPS that can be pushed to the system. However, apart from system characteristics as these, storage workloads vary reason- ably and therefore IOPs numbers depend heavily on the type of workloads provisioned on storage systems. In this work, we treat storage system as a black-box and propose a solution that will equip admin make provisioning decisions based on knowledge of workloads.
- The definitive version of the paper can be found at:
Reinventing RDMA with Remote Direct Function Access (RDFA)
The advent of fast non-volatile, main memory technologies (e.g., PCM, RRAM, or STTM) leads to new trade-offs in designing storage systems and the network protocols used to access them. Conventional application-level network protocols (e.g., the set of RPCs that constitute the NFS interface) are too slow fully exploit the low-latency these new memories offer. RDMA is a possible alternative, but its limited functionality (just read and write) means that many operations require multiple RDMA requests, each of which requires a network round trip, forfeiting any efficiency gains.
Steven Swanson proposes to build a system called Remote Direct Function Access (RDFA) that will increase the flexibility of RDMA-style requests and allow more complex protocols to operate without the overhead of a conventional network stack. RDFA relies on a specialized network card that provide programmers with a flexible programming environment to implement application-specific operations that remote nodes can invoke to carry out complex operations (e.g.,inserts or lookups in a hash table). Steven Swanson will prototype the specialized NICs in FPGAs and evaluate the system in a cluster of 32 machines.
We introduce Falcon codes, a class of authenticated error correcting codes that are based on LT codes and achieve the following properties, for the first time simultaneously: (1) with high probability, they can correct adversarial corruptions of an encoded message, and (2) they allow very efficient encoding and decoding times, even linear in the message length. Our design framework encompasses a large number of such coding schemes. Through judicious use of simple cryptographic tools at the core LT-coding level, Falcon codes lend
themselves to secure extensions of any LT-based fountain code, in particular providing Raptor codes that achieve resilience to adversarial corruptions while maintaining their fast encoding/decoding times. Falcon codes also come in three variants, each offering different performance trade-offs. For instance, one variant works well with small input messages (100s of KB to 10s of MB), but two other variants are designed to handle much larger messages (several GB).
We study Falcon codes in a novel adversarial model for rateless codes over computational (corrupting) channels and prove their security under standard assumptions. We analyze the performance of our new coding schemes through a prototype implementation of their Raptor code extension and a thorough experimental study that demonstrates their high efficiency in practice.
Applied to data transmission, Falcon codes can provably protect Raptor codes against targeted-erasure attacks, which were recently shown by Lopes and Neves [Oakland, 2014] to cause decoding failures of RaptorQ—the most advanced, standardized (IETF RFC6330) rateless code used in practice. Applied to data storage, Falcon codes can provide significant efficiency gainings as drop-in replacements of Reed-Solomon codes; in particular, a 35% speed-up over the state-of-the-art PoR scheme by Shi et al.
- The definitive version of the paper can be found at: http://delivery.acm.org/10.1145/2820000/2813728/p1032-juels.pdf.
HotPower 2015 Workshop
Ramana Reddy, Atish Kathpal, Jayanta Basak, and Randy Katz
Legacy archival workloads have a typical write-once-read-never pattern, which fits well for tape based archival systems. With the emergence of newer applications like Facebook, Yahoo! Flickr, Apple iTunes, demand for a new class of archives has risen, where archived data continues to get accessed, albeit at lesser frequency and relaxed latency requirements. We call these types of archival storage systems as active archives. However, keeping archived data on always spinning storage media to fulfill occasional read requests is not practical due to significant power costs. Using spin-down disks, having better latency characteristics as compared to tapes, for active archives can save significant power. In this paper, we present a two-tier architecture for active archives comprising of online and offline disks, and provide an access-aware intelligent data layout mechanism to bring power efficiency. We validate the proposed mechanism with real-world archival traces. Our results indicate that the proposed clustering and optimized data layout algorithms save up to 78% power over random placement.
- The definitive version of the paper can be found at: http://delivery.acm.org/10.1145/2820000/2818742/p16-reddy.pdf.
Configuration problems are not only prevalent, but also severely impair the reliability of today’s system software. One fundamental reason is the ever-increasing complexity of configuration, reflected by the large number of configuration parameters (“knobs”). With hundreds of knobs, configuring system software to ensure high reliability and performance becomes a daunting, error-prone task.This paper makes a first step in understanding a fundamental question of configuration design: “do users really need so many knobs?”
- The definitive version of the paper can be found at: http://cseweb.ucsd.edu/~tixu/papers/fse15.pdf.
Network and Storage Stack Specialization for Performance
Over the last two years, Ilias Marinos, a doctoral student at the University of Cambridge, in collaboration with Mark Handley (UCL) and Robert Watson, have been pursuing a research project on clean-slate network-stack design and network-stack specialization, testing two hypotheses: (1) that current network-stack designs, dating from the 1980s, fail to exploit contemporary architectural features for performance and hence suffer significant penalties – and that re-architecting fundamental aspects of stack design with micro-architectural awareness will dramatically improve performance; and (2) that the generality in current network-stack designs, substantially hampers application performance whereas ‘specialized’ stacks that integrate applications with the network stack itself can offer dramatic performance improvement opportunities. The team has prototyped a clean-slate, userspace TCP stack, published at SIGCOMM 2014, illustrating these effects on high-performance network traffic for in-DRAM workloads, experiencing substantial performance benefits (e.g., 6x throughput with a tiny fraction of CPU utilization) for HTTP and DNS workloads. The team proposes to extend this work to include a clean-slate network-storage stack based on a new userspace ‘diskmap’ facility for PCI-attached flash to cater to workloads with footprints greater than DRAM size – e.g., high-volume HTTP/HTTPS content-delivery networks (CDNs), and RPC-based filesystem services. Where sensible and appropriate, the team propose to open source (and upstream) artifacts developed during this work under a BSD-style license, as well as pursue collaboration opportunities with NetApp.