KERDATA Research team

Scalable Storage for Clouds and Beyond

Team presentation

Our research activities address the area of distributed data management at challenging scales, with a particular focus on clouds and petascale architectures. We target data-oriented high-performance applications that exhibit the need to handle massive non structured data - BLOBs: binary large objects (on the order of terabytes) - stored in a large number (thousands to tens of thousands), accessed under heavy concurrency by a large number of clients (thousands to tens of thousands at a time) with a relatively fine access grain (on the order of megabytes). Examples of such applications are:
  • Cloud data-mining applications (e.g., based on the MapReduce paradigm) handling massive data distributed at a large scale.
  • Advanced (e.g., concurrency-optimized, versioning-oriented) cloud services both for user-level data storage and for virtual machine image storage and management at IaaS level.
  • Distributed storage for post-Petascale computing applications.
  • Storage for desktop grid applications with high write throughput requirements

Research themes

  • Multiversion BLOB management

    We work at the design, implementation and experimental validation of a generic data-sharing platform called BlobSeer destined to serve as a basis for addressing the challenges mentioned above: huge data, highly concurrent fine-grain access, while supporting versioning and decentralized metadata management.

  • Cloud data management

    On Infrastructure-as-a-Service (IaaS) cloud infrastructures, computing resources are exploited on a per-need basis: instead of buying and managing hardware, users rent virtual machines and storage space. One important issue is thus the support for storing and processing data on externalized, virtual storage resources. Such needs require simultaneous investigation of important aspects related to performance, scalability, security and quality of service. Moreover, the impact of physical resource sharing also needs careful consideration. We are exploring how the file system approach can support scalable data management to address the above needs of data-mining through massive data using the Map-Reduce paradigm.

  • Data management for Post-Petascale systems

    In parallel with the emergence of cloud infrastructures, considerable efforts are now under way to build Petascale computing systems , such as Blue Waters. Such systems aim to provide sustained Petaflop performance to a much wider spectrum of science and engineering applications. On such infrastructures, data management is again a critical issue with a high impact on the application performance. Such supercomputers exhibit specific architectural features (e.g., a multi-level memory hierarchy scalable to tens to hundreds of thousands of cores) that are specifically designed to support a high degree of parallelism. In order to keep up with such advances, the storage service has to scale accordingly, which is clearly challenging. We focus on numerical applications for post-Petaflopic architectures: the goal is to evaluate the benefits of using the BlobSeer approach for concurrency-optimized I/O.

International and industrial relations

  • MapReduce: an ANR project on MapReduce-based cloud data management with international and industrial partners: Argonne National Lab (USA), the University of Illinois at Urbana-Champaign (UIUC, USA), IBM
  • FP3C: an ANR-JST project on programming post-Petascale infrastrcutures, gathering the major French and Japanese academic actors in this area. Strong collaboration with Tsukuba University, Japan.
  • NCSA/UIUC: active collaboration with the JLPC (Urbana-Champaign) on concurrency-optimized I/O for post-Petascale infrastructures
  • SCALUS: a Marie Curie Initial Training Network (FP7).
  • DataCloud@work: Associate Team with the "Politehnica" University of Bucharest, Romania.

Keywords: Data management Cloud Post-Petascale HPC Large-Scale BLOB Distributed File System BlobSeer Map-Reduce Programming Model Fault-Tolerant Middleware