Tag Archives: distributed_systems

Patrick Eugster, Purdue University – August 2014

boberc-72x90Federated Big Data Processing

Big data processing represents one of the major challenges of this era. Cloud computing technologies offer a partial response to the computation and storage resource needs of big data processing. However, in contrast to the illusion of omni-present uniform storage and computation resources promoted by many cloud providers, clouds are implemented by concrete datacenters with specific locations; big data is often geographically distributed (geo-distributed) or even partitioned across datacenters (or zones, sub-networks), and inter-datacenter access times differ strongly from intra-datacenter ones. Despite the substantial impact of communication costs on response times in big data applications – up to 50% even in single-datacenter setups – existing systems deployed in clouds work poorly across datacenters, e.g., by naïvely copying all data first “to the mothership”, if they support such setups at all. Existing work on geo-distributed clouds considers data storage but not the problem of computing complex queries on such data.

We consider a setup where big datasets are distributed or even partitioned across datacenters, and we want to execute queries across these datasets, exploiting parallelization. Rather than naïvely copying all data to one datacenter before running a query our goal intuitively is to move parts of such a query towards the respective data, and make smart decisions on where and when to consolidate data as needed by the query. Given that inter-datacenter communication is more costly (time-wise, and financially) than intra-data center communication, and that the amount of data in big data jobs commonly decreases as computation proceeds,consolidating data as late as possible is generally beneficial.

Muriel Médard, Massachusetts Institute of Technology – August 2014

medard.jpgSecurity Considerations in Network Coded Distributed Storage Systems

The use of Random Linear Network Coding (RLNC) for improving the performance, in terms of download delay reduction and increase of data availability in distributed storage systems, has motivated its use in distributed storage systems.  Security issues in such systems have been studied in the context of reliability to Byzantine attacks or to cryptanalysis by a passive eavesdropper. We propose to consider security of distributed storage systems using RLNC when attacks are active probing attacks. We use recent results for guesswork, which characterizes the number of queries an attacker, sometimes termed inquisitor, will complete before ascertaining a secret quantity, such as a password or private data. In particular, we seek to examine the effect of data non-uniformity on guesswork when RLNC is applied and the behavior of guesswork in multi-cloud systems protected by keys.