Tag Archives: mascots14

WORMStore: A Specialized Object Store for Write-Once Read-Many Workloads

mascots14Srinivasan Narayanamurthy, Kartheek Muthyala and Gaurav Makkar.

The recent increase in interest for batch analytics has resulted in extensive use of distributed frameworks such as Hadoop and Dryad. Batch analytics—as the name suggests, perform many computations on large volumes of data.

That is, large quantities of data are ingested once and read many times mostly in large chunks, which is characterized as write-once read- many (WORM) workload. The storage part of these distributed frameworks (say, HDFS in Hadoop) use file systems such as ext4 or XFS as native object stores to store objects as files in individual nodes of the distributed system. These general purpose file systems were designed with broader goals such as POSIX-compliance, optimal performance for a wide range of file size, user friendliness, etc. However, most of these features are not required for a native object store in distributed file systems.

WORMStore is a light weight object store that is designed exclusively for use in distributed systems for WORM workload. WORMStore provides interesting advantages such as the ability to prefetch large objects, small metadata to data ratio, media aware data/metadata placement, etc. As WORMStore is log-structured, it provides the ability to recover upon failure. Our experiments show that WORMStore provides a 28% increase in the read throughput per node in a Hadoop cluster.

  • The author’s version of the paper is attached to this posting. Please observe the following copyright:

© 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

  • The definitive version of the paper can be found at:

Nakshatra: Towards running batch analytics on an archive

mascots14Atish Kathpal and Giridhar Yasa 

Long term retention of data has become a norm for
reasons like compliance and data preservation for future needs.
With storage media continuing to become cheaper, this trend has
further strengthened and is testified with introduction of archival
solutions like Amazon Glacier and Spectra Logic BlackPearl.

On the other hand, analytics and big data have become key enablers
for business and research. However, analytics and archiving
happens on separate storage silos. This generates additional costs
and inefficiencies when part of archived data needs to be
analyzed using batch analytics platforms like Hadoop because a)
We need additional storage for data transferred from archive to
analytics tier and b) Transfer time costs are incurred due to data
migration to analytics tier. Moreover, accessing archived data
has high times to first byte, as much of the data is stored in
offline media like tapes or spun down disks. We introduce
Nakshatra, a data processing framework to run analytics directly
on an archive based on offline media. To the best of our
knowledge, this is the first work of its kind available in literature.
We leverage batched pre-fetching and scheduling techniques for
improved retrieval of data and scalable analytics on archives.
Our preliminary evaluation shows Nakshatra to be upto 81%
faster than the traditional ingest-then-compute workflow for
archived data.

  • The author’s version of the paper is attached to this posting. Please observe the following copyright:
  • © 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

    • The definitive version of the paper can be found at: