Sites Inria

Version française

TADAAM Research team

Topology-aware system-scale data management for high-performance computing

Team presentation

The goal of the TADaaM project is to design and build a stateful system-wide service layer for HPC systems. This layer will be twofold. First, it will abstract low-level features of the system (e.g. topology, network, resource usage) and of the software stack (e.g. threads, data, runtime system). Second, applications will be able to register their needs and behaviors thanks to a carefully designed API. With these two sets of information, the layer will optimize the execution of all the running applications in a coordinated fashion and at system-scale

Research themes

TADaaM aims at tackling the problem of efficiently executing an application, at system-scale, on an HPC machine. We assume that the application is already optimized (relevant data layout, use of effective libraries and of state-of-the-art compilation techniques, etc.). Nevertheless, even a statically optimized application will not be able to scale without considering the following dynamic constraints: the machine topology, the set of allocated resources, the data movement and contention, the presence of other running applications, the storage access, etc. Thanks to the proposed layer, we will provide a simple yet efficient way for legacy applications as well as new ones to address the above issues. By expressing their needs in terms of resource usage, locality and topology, using a high-level semantic to the proposed layer and thanks to an adequate set of optimizations we will efficiently improve the data management of the applications.

To solve these problems we will address three directions. First, we need to address issues on modeling and abstracting the parallel platforms and the applications as this project deals with the interactions between the machines and the running applications. Second, we need to provide services to the application based on efficient tools and scalable algorithms. Last, we need to expose these services to the applications through well-defined abstractions and API in a scalable way.

International and industrial relations

Industry:

  • CEA
  • EDF R&D
  • ATOS/Bull
  • Intel
  • Mellanox
  • IBM
  • AMD

Academia:

  • Argonne Nationnal Lab, USA
  • University of Tennessee, Knoxville, USA
  • INESC-ID, Lisbon, Portugal
  • University of Tokyo and Riken, Japan
  • Barcelona Supercomputing Center, Spain
  • Sandia National Lab, USA

Keywords: High-performance computing System-scale optimization Mesh-based applications Performance modeling Topology Affinity Resource management.