Cloud computing

Jean-Michel Prima - 18/06/2012

BlobSeer : A Storage System For The Exascale Era

Gabriel Antoniu, leader of KerData. “There are two uses for BlobSeer: one at application data level as we evidentiate through Microsoft's Azure cloud, another at system level as we illustrate through Nimbus." 

The ever-increasing size of data threatens to impair the efficiency of cloud services as well as science-oriented High Performance Computing (HPC). BlobSeer is an innovative storage system designed to improve massively parallel data access through a versioning device for concurrent manipulation of binary large objects (BLOBs). This approach prompted a keen interest from Microsoft, IBM and SAP, as Rennes-based Inria researcher Gabriel Antoniu explains.

The exabyte? One quintillion bytes, short scale. A 1000-fold increase over the previous 1015 landmark. A quantum leap that current database technology simply can't cope with. A pressing problem as data factories keep on churning out fast-growing volumes. The syndrome even has a name: Big Data. In conjunction to this inflation, comes another phenomenon: science is going data-centric. Genomics, oceanography, astronomy and alikes are mining ever more information from huge data warehouses. So much so that, according to the late Jim Gray of Microsoft “it is worth distinguishing data-intensive science (...) as a new, fourth paradigm for scientific exploration” following computational, theoretical and empirical eras.
“Exaflop computers will be available for HPC by 2020, then clouds will follow the trend, predicts Gabriel Antoniu, head of KerData research team. But faced with heretofore unknown scales, current data-management tools are showing their limitations. Today's architectures rely upon centralized storage of metadata. That's precisely what bottlenecks their performance. Hundreds of thousands of files are being simultaneously created and updated. This resource-consuming process proves tremendously expensive. It hinders the overall efficiency of hardware. BlobSeer is a data storage system that contributes to solving this data-management problem. Its hinges upon a versioning method for concurrent manipulation of BLOBs that helps to sustain a high throughput despite massively parallel data access.”

Storage Backend For MapReduce

The method offers a storage backend for MapReduce applications. Popularized by Google to support distributed computing on clusters, MapReduce acts as a double-filter. “It first extracts the data of interest (Map) and then aggregates the results (Reduce). But by so doing, it generates a lot of parallel data accesses. This access is what BlobSeer can ease.”
Among several solutions that implement the MapReduce programming model on clouds, the most famous is Hadoop. Developed by the Apache Foundation, this open source framework is being used by web giants such as Yahoo, FaceBook or Amazon. “When using BlobSeer instead of its own file system (HDFS), the software delivers significantly better performance. In the future, cloud services might be interested in using Hadoop with our alternative component.” And IBM might be one of them. "We have partnered through an ANR-sponsored research project in this field.”  More recently, the German software company SAP has expressed similar interest in evaluating this approach.
 Also on researchers' front burner is TomusBlobs, a BlobSeer version for Azure, Microsoft's cloud infrastructure. “That's currently the project on the fastest track within the company's Cloud Research Initiative. Results have been presented to Henrique Malvar, Chief Scientist at Microsoft Research and to Tony Hey, Corporate Vice President of Microsoft Research Connections during their recent visits to our joint center near Paris. Benchmark shows a significantly faster throughput: 3 times in write and 2.5 times in read.”

Also At System Level

But BlobSeer isn't solely a solution for efficiently storing and managing application data. “It can also be of use at system level for deploying and storing images of virtual machines on a cloud. You might have to deal with hundreds or thousands of VM instances simultaneously. But since deployment relies on a centralized achitecture, this preliminary step waxes time-consuming.”
Once this VM is up and running, comes the crash contingency plan. One must be able to restart from a consistent state. Hence the need for snapshotting. “Again, this implies a massively concurrent access to data. Instead of relying on centralized storage of checkpoints, BlobSeer can provide a distributed repository. The cloud user will benefit from fast concurrent reads during deployment and fast concurrent writes during backups. This BlobSeer's VM function is of particular interest to cloud service providers. In partnership with Argonne National Laboratory, we illustrate this capability through Nimbus,” an open-source toolkit dedicated  to providing Infrastructure-as-a-Service (IaaS) to the scientific community.
Meanwhile researchers have started to explore yet another fallow plot. “Pyramid is a data access management method that borrows abundantly from BlobSeer but deals more specifically with array-oriented storage. Instead of BLOBs, we will focus on multidimensional data. In the HPC context, many scientific applications rely on such parallel array processing.  But there again, concurrent access has reached its limitations.”

Keywords: Blobseer Storage system Inria Rennes Kerdata Gabriel antoniu Cloud computing

Top