Provisioning and managing SSD/NVM-based disk caching in derivative clouds
Flash-based non-volatile storage devices (SSDs) have been widely used for disk caching in data-center infrastructure setups. With virtualization enabled hosting, SSD cache partitioning and provisioning is an important management question for resource controllers to ensure disk IO performance guarantees. While this problem has received considerable attention in the server-side cache management domain, the following gaps remain and we aim to explore them:
Inclusive vs. exclusive disk caching
SSD/NVM-based disk caching for containers
SSD/NVM-based caching for nested containers in derivative cloud setups
Prioritizing Attention in High-Volume, High-Dimensional Data Streams
The rise of Big Data infrastructure has supercharged the collection of high volume, highly heterogeneous data. However, increasingly, this data is too big for any cost-effective manual inspection, and, in practice, much of this “fast data” is only accessed in exceptional cases (e.g., to debug a failure). As a result, important behaviors often go unnoticed, leading to inefficiency, wasted resources, and limited visibility into complex application deployments.
In response, this research focuses on developing infrastructure for prioritizing human attention in this fast data, automatically providing users with high-level interpretable summaries and explanations of key behaviors in data streams. Early results with such a fast data prototype, MacroBase, have proven extremely promising at small scale: by combining statistical classification and explanation, MacroBase has discovered a range of previously unknown behaviors in mobile applications, online services, and manufacturing. This allows users to do more with the data they store, and to also provide insights about deployments of complex software such as storage systems.
However, scalability along two dimensions remains particularly challenging. First, data volumes continue to rise, at much faster rates than processor speeds. Second, data is collected from an increasing number of sources, requiring data fusion and multi-source processing. In response, this research proposes to adapt methods from conventional query processing to this new domain of fast data analysis. This research proposes to use approximate query processing techniques in sample-based stochastic stream processing (maintaining accuracy while running on a small subset of the stream) and multi-query processing to analyze multiple sources at once, both of which may have substantial impact on the design and implementation of storage software. This research plan to integrate and evaluate our techniques within MacroBase on real-world IoT sensor, time-series, and telemetry data as well as any available telemetry from NetApp storage systems.
Hybrid clouds are increasingly being deployed and enable seamless data movement between public and private environments. However, when data is stored in a public cloud, significant challenges arise in making encryption techniques work together with search, deduplication, and compression. The proposed project has two main components: (1) deduplication of encrypted data and (2) searching compressed and encrypted data. Concerning deduplication, they propose a new technique for deduplicating encrypted data that is based on locality-sensitive hashing and tolerates small changes in the underlying plaintext data without blowing up the space too much. Concerning search, they propose using new compression techniques inspired by the database community to perform searches on encrypted data much more efficiently that existing systems. They expect their research to improve the state-of-the-art both in theory and systems and to even lead to new approaches and algorithms for performing more efficient queries on deduplicated and compressed unencrypted data.
Maintaining locality in a space-constrained file system
Donald Porter (UNC), Michael Bender and Rob Johnson (Stony Brook) & Martin Farach-Colton (Rutgers) have teamed up on the collaborative research entitled “Maintaining locality in a space-constrained file system”
It is increasingly common for file systems to run out of space, especially on mobile phones, tablets, and laptops. Under space pressure, common file system data placement heuristics abandon any attempt to preserve locality, and do little, if anything, to recover data locality once space pressure is relieved, such as by deleting data or offloading data to the cloud.
Donald Porter & team proposes developing a variant of a packed memory array (PMA) data structure, suitable for use in a local file system, that can preserve locality at an acceptable cost; make reasonable trade-offs under extreme space pressure; and recover locality after space pressure is relieved. This work builds on the success of the BetrFS team in bridging the gap between data structure theory and file system practice, and includes both work to adapt this data structure to meet the constraints of a file system, as well as file system design work to leverage PMAs. In this project, Donald Porter will evaluate the technique in both a simple ext2-style file system, as well as within their ongoing BetrFS write-optimized file system project. If successful, this work will improve file system performance, not just under ideal circumstances, but also under adverse, yet increasingly common, conditions.
SMR drives may be incorporated into the storage stack as drive-managed devices, via host-resident block translation layers, or through the use of SMR-specific log-structured file systems (LFSs), at an engineering cost ranging from modest (drive-managed) to very large (LFS). The first generation of drive-managed SMR devices has shown significant performance deficiencies when compared to conventional drives, but at this point little is known about how well SMR can perform with better translation algorithms or tuned file systems.
This work proposes a combination of high-level (trace analysis and simulation) and low-level (in-kernel implementation and benchmarking) investigation into both translation layers and file systems, to determine how fast SMR can be on realistic workloads, and at what cost – i.e. whether good SMR performance requires a change in file system, or may be achieved via translation layers in the host or device.
The investigation builds on the base of SMR and flash translation layer research performed at Northeastern by the PI over the last eight years, uses novel software artifacts developed in the PI’s lab (NSTL, an in-kernel translation layer with reprogrammable cleaning and placement algorithms), and leverages partnerships with key Linux file system developers.
The Speed of Non-Volatile Main Memory and Capacity of Disk: A Fast, Strongly-Consistent File System for Heterogeneous Storage Stacks
Professor Steven Swanson’s group have built a file system called NOVA, a log-structured file system for hybrid volatile/non-volatile main memories. In this proposal, they would like to extend NOVA to provide tiering and/or caching capabilities that will allow it combine the high performance of NVMMs with the cost-effective capacity of SSDs and hard drives. They are planning to explore multiple approaches to allocating valuable NVMM to maximize performance, and evaluate these approaches on a range of critical storage applications. They are also considering to examine how the techniques they build into NOVA can allow other storage systems (e.g., “all-flash arrays,” object stores, and large file servers) to leverage NVMM as well.
Improving Archival Storage via Selective Deterministic Recomputation
Professor Jason Flinn and his students at the University of Michigan have built a prototype file system for archival data that selectively replaces file data with logs that reproduce that data. This substantially reduces the bytes written and stored for cold file data, even compared to aggressive storage efficiency mechanisms such as delta compression and chunk-based deduplication.
In this project Professor Flinn’s team will extend these results in several ways. First, they will investigate how to structure logs to maximize compression since the logs of non-determinism can themselves be deduplicated. Second, they will explore whether minimal logging can still reproduce data faithfully; it is only necessary to store one computation that generates the needed data but not the precise computation that originally generated the data. Third, they will look at how semi-determinism can improve results; by encouraging applications to behave in predictable ways, can they reduce log sizes and improve storage savings? Finally, they will evaluate their system on a wider variety of realistic workloads to examine which domains see the most benefit from this class of archival storage.
Automate Failure Diagnosis with the Flow Reconstruction Principle
When distributed systems fail in the field, identifying the root cause and pinpoint the faulty software component or machine can be extremely hard and time-consuming. This research aims to provide an end-to-end solution to automate the diagnosis of production failures on distributed software stack solely using the unstructured logs output. It builds this in three parts. First, it aims to design a new postmortem diagnosis tool to automatically reconstruct the extensive domain knowledge of the programmers who wrote the code; it does this by relying on the principle (“Flow Reconstruction Principle”) that programmers log events such that one can reliably reconstruct the execution flow a posteriori. However, any postmortem debugging relying on log output hinges on the efficacy of such logging. This is the focus of the second part, which is, to measure the quality of software’s log output. Finally, they intend to use this measurement to ultimately automate software logging itself.
Design and Implementation of Efficient Auditing Schemes for Cloud Data
The aim of this project is to make cloud data secure, available and accessible to authorized users. To protect against disk crashes, multiple copies of data might be stored. The cloud service provider (CSP) should abide by the terms of the Service Level Agreement (SLA). However, the CSP can be untrusted and might delete or modify data. To protect against this, data auditing is required. However, unlike maintaining transaction logs, the CSP should be able to prove the data owner that the data is intact and the data can be retrieved correctly. Proofs of storage are thus important. Data owner might also wish to delegate the auditing task to a third party. Thus, it is important that the third party performs the audit, without even knowing the content. This is known as privacy preserving data auditing. Most of the techniques are not practical for the dynamic case (where client can modify data) and multi-server model. In this project, we will aim at designing practical and provably secure privacy-preserving auditing schemes. We will use techniques from authenticated data structures, signature schemes, cryptographic accumulators, secure network coding to solve the problem.
A large population of customers can be aﬀected by sudden slowdown or abnormal behavior of enterprise wide application or product. Analysts and developers of large scale systems spend considerable time dealing with functional and performance bugs. Timely identiﬁcation of signiﬁcant change in application behavior may help us in providing early informative warning and subsequently prevent the negative impact on the service. In this project, we aim to develop a framework to predict the sudden system anomaly in advance by analysis of log data generated by sub modules in system and to develop an automated warning system to protect the system slipping towards failure. Note that this was the ﬁrst year of the project where start-up grant has been provided by NetApp to primarily explore the problem and produce initial results. However, we have several achievements and several insights developed during this one-year stint