Tag Archives: storage_efficiency

MixApart: Decoupled Analytics for Shared Storage Systems

fast13_button_125.pngMadalin Mihailescu, University of Toronto and NetApp; Gokul Soundararajan, NetApp; Cristiana Amza, University of Toronto

MixApart uses an integrated data caching and scheduling solution to allow MapReduce computations to analyze data stored on enterprise storage systems.


Distributed file systems built for data analytics and enterprise storage systems have very different functionality requirements. For this reason, enabling analytics on enterprise data commonly introduces a separate analytics storage silo. This generates additional costs, and inefficiencies in data management, e.g., whenever data needsto be archived, copied, or migrated across silos. MixApart uses an integrated data caching and scheduling solution to allow MapReduce computations to analyze data stored on enterprise storage systems. The front-end caching layer enables the local storage performance required by data analytics. The shared storage back-end simplifies data management.

We evaluate MixApart using a 100-core Amazon EC2 cluster with micro-benchmarks and production workload traces. Our evaluation shows that MixApart provides i) up to 28% faster performance than the traditional ingest then-compute workflows used in enterprise IT analytics, and ii) comparable performance to an ideal Hadoop setup without data ingest, at similar cluster sizes.

In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13), February 2013.



Xiaonsong Ma, North Carolina State University – May 2012

Ma.jpgCAFIO: Framework for Generating Content-Aware I/O Benchmarks

Xiaosong Ma, North Carolina State University – May 2012
Storage efficiency is one of the important features considered in designing, evaluating, and deploying enterprise storage systems. In recent years, the adoption of deduplication has propagated from backup and archive systems to primary storage. With both commercial/consumer data size and host/server compute power fast growing, it is promising for today’s intelligent storage systems to adopt inline efficiency optimization techniques. With inline efficiency optimization, tasks such as deduplication and compression are performed as a part of the I/O pathway. For example, tremendous inline deduplication opportunities exist in scenarios such as the desktop virtualization infrastructure (VDI). In addition to the well-known benefit of storage space saving, these techniques can also significantly reduce write traffic to storage devices. With the current trend of adopting SSDs, this is a particularly important advantage, which translates into both performance and device life time improvement.

At the same time, inline storage efficiency optimization requires more careful workload-dependent cost- benefit analysis. Compared to backup or archiving, we may see a smaller data duplication ratio in primary workloads. In primary storage we also have more strict requirements in IOPS and latency. In addition, com- pression effectiveness and benefit heavily depends on the innate compressibility of the data content, as well as the compression algorithm chosen. The design tradeoff is further complicated by the integration of new storage media such as SSDs and other non-volatile memory devices into the primary storage hierarchy.

Unfortunately, currently there lack representative and flexible testing tools to evaluate such storage efficiency optimization techniques. Unlike conventional storage systems research, storage efficiency studies have additional requirement for their testing workloads. For such evaluation, I/O traces should preserve data content information along with the logical block addresses and time stamp information captured in conventional I/O traces. Similarly, I/O benchmarks should access realistic data contents. Currently, research work in this area typically test content-aware optimization using real-world workloads, including both industry data or contentful benchmarks (such as TPC-like applications). With the latter, the data and storage access patterns are fixed with specific applications. With the former, it is challenging to collect and share proprietary workload content and I/O traces, as (1) commercial or personal data content is often sensitive and cannot be directly released, (2) I/O traces fully coupled with block contents can be enormous in size and cumbersome to use, and (3) such contentful traces are in most cases inflexible and hard to manipulate to create parametric experiments.

This proposed project exploits the idea of a content-aware synthetic I/O workload generator with a focus on generating synthetic I/O content that captures the content duplication characteristics of different real-world workloads. The proposal is to develop CAFIO (Content Aware Flexible I/O), a content generation framework that addresses the drawbacks of using real-world traces for storage efficiency study. If successful, this research will produce a flexible and portable evaluation tool that can facilitate emerging research activities in this area.

Specific research challenges to be addressed in this proposed work include (1) to automatically extract content features that describes data duplication behavior relevant to storage efficiency optimization, (2) to efficiently re-generate data at trace replay or benchmark execution time, reproducing the original data duplication characteristics yet without revealing original data content, and (3) to enable an flexible and extensible tool framework that can accommodate new content features and/or evaluate additional storage efficiency algorithms.