Nakshatra: Towards running batch analytics on an archive

mascots14Atish Kathpal and Giridhar Yasa 

Long term retention of data has become a norm for
reasons like compliance and data preservation for future needs.
With storage media continuing to become cheaper, this trend has
further strengthened and is testified with introduction of archival
solutions like Amazon Glacier and Spectra Logic BlackPearl.

On the other hand, analytics and big data have become key enablers
for business and research. However, analytics and archiving
happens on separate storage silos. This generates additional costs
and inefficiencies when part of archived data needs to be
analyzed using batch analytics platforms like Hadoop because a)
We need additional storage for data transferred from archive to
analytics tier and b) Transfer time costs are incurred due to data
migration to analytics tier. Moreover, accessing archived data
has high times to first byte, as much of the data is stored in
offline media like tapes or spun down disks. We introduce
Nakshatra, a data processing framework to run analytics directly
on an archive based on offline media. To the best of our
knowledge, this is the first work of its kind available in literature.
We leverage batched pre-fetching and scheduling techniques for
improved retrieval of data and scalable analytics on archives.
Our preliminary evaluation shows Nakshatra to be upto 81%
faster than the traditional ingest-then-compute workflow for
archived data.

  • The author’s version of the paper is attached to this posting. Please observe the following copyright:
  • © 2012 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

    • The definitive version of the paper can be found at: