Electronic data volumes keep growing at rapid rates, costing users precious space and increasing TCOs (energy, performance, etc.). Data-deduplication is a popular technique to reduce the actual amount of data that has to be retained. Several vendors offer dedup-based products, and many publications are available. Alas, there is a serious lack of comparable results across systems. Often, the problem is a lack of realistic data sets that can be shared without violating privacy; moreover, good data sets can be very large and difficult to share. Many papers publish results using small data sets or non-representative ones (e.g., successive Linux kernel source tarballs). Lastly, there is no agreement what constitutes “realistic” data sets.
We propose to develop tools and techniques to produce realistic, scalable dedupable data sets, taking actual workloads into account. We will begin by analyzing dedupability properties of several different data sets we have access to; we will develop and release tools for anyone to analyze their own data sets without violating privacy. Next, we will build models that describe the important inherent properties of those data sets. Afterward, we will be able to generate data synthetically that follows these models; we will generate data sets far larger than their originals, but faithfully modeling the original data.