Madalin Mihailescu, University of Toronto and NetApp; Gokul Soundararajan, NetApp; Cristiana Amza, University of Toronto
MixApart uses an integrated data caching and scheduling solution to allow MapReduce computations to analyze data stored on enterprise storage systems.
Distributed ﬁle systems built for data analytics and enterprise storage systems have very different functionality requirements. For this reason, enabling analytics on enterprise data commonly introduces a separate analytics storage silo. This generates additional costs, and inefﬁciencies in data management, e.g., whenever data needsto be archived, copied, or migrated across silos. MixApart uses an integrated data caching and scheduling solution to allow MapReduce computations to analyze data stored on enterprise storage systems. The front-end caching layer enables the local storage performance required by data analytics. The shared storage back-end simpliﬁes data management.
We evaluate MixApart using a 100-core Amazon EC2 cluster with micro-benchmarks and production workload traces. Our evaluation shows that MixApart provides i) up to 28% faster performance than the traditional ingest then-compute workﬂows used in enterprise IT analytics, and ii) comparable performance to an ideal Hadoop setup without data ingest, at similar cluster sizes.
In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’13), February 2013.
Studying the Impact of Storage System Design on Hadoop Performance
The project aims to quantify the trade-offs between using local disks versus NAS for supporting Hadoop workloads, and identifying bottlenecks and scenarios under which one approach is beneficial over the other. Next, we plan to leverage these findings to investigate different storage-compute configurations. The overall goal is to study how NAS can be employed to complement Hadoop systems. Moreover, we are interested in identifying applications’ characteristics that make them a better fit for a NAS-enabled Hadoop environment versus the original Hadoop setup.
A research challenge is to capture how integration of different storage system designs in Hadoop affect overall performance, and how the storage system can be tailored to meet the I/O demands of applications. Exploring such design decisions using real setups is ideal. Unfortunately, limited resources and very long time-to-solution render such real evaluation impractical. An alternative is to evaluate the new system design in a simulator that accurately captures the behavior of Hadoop and its various parameters and configurations. We have built such a simulator, MRPerf, which can simulate performance of Hadoop applications, when provided with infrastructure specification and workload characteristics.
We will design, develop, implement, and evaluate different storage designs for Hadoop setups through simulation. We will study and evaluate different scenarios of storage integration with Hadoop. We will add a storage device model (based on feedback from NetApp) in the MRPerf simu- lator. This will include extending MRPerf and developing a new interface for a configurable storage device, with parameters such as I/O bandwidth and latency. We aim to capture critical interactions such as network contention, processor contention and storage contention to a reasonable extent. The results of such simulations can provide significant insights about the system and enable design of suitable real storage devices for Hadoop.