DATAMOVE Research team
Data Aware Large Scale Computing
- Leader : Bruno Raffin
- Type : team
- Research center(s) : Grenoble
- Field : Networks, Systems and Services, Distributed Computing
- Theme : Distributed and High Performance Computing
- Inria teams are typically groups of researchers working on the definition of a common project, and objectives, with the goal to arrive at the creation of a project-team. Such project-teams may include other partners (universities or research institutions)
Today's largest supercomputers (Top500 ranking) are composed of hundreds of thousands of cores, with performances reaching the PetaFlops. Moving data on such large supercomputers is becoming a major performance bottleneck, and the situation is expected to worsen even more at exascale and beyond. The data transfer capabilities are growing at a slower rate than processing power ones. The profusion of flops available will be difficult to use efficiently due to constrained communication capabilities.
The memory hierarchy and storage architecture are expected to deeply change with the emergence of new technologies like non volatil memories (NVRAM), requiring
new approaches to data management. Data movements are also an important source of power consumption, and thus a relevant target for energy savings.
The DataMove team addresses these challenges, performing research to optimize data movements for large scale computing. DataMove targets four main research axis:
- Integration of High Performance Computing and Data Analytics
- Data Aware Batch Scheduling
- Empirical Studies of Large Scale Plateforms
- Forecasting Resource Availability
The batch scheduler is in charge of allocating resources upon user requests for application executions (when and where to execute a parallel job). The growing cost of data movements requires adapted scheduling policies able to take into account the influence of intra-application communications, IOs as well as contention caused by data traffic generated by other concurrent applications. Modelling the application behavior to anticipate its actual resource usage on such architecture is challenging but critical to improve performance (time, energy). The scheduler also needs to handle new workloads. High performance platforms now need to execute more and more often data intensive processing tasks like data analysis in addition to traditional computation intensive numerical simulations. In particular, the ever growing amount of data generated by numerical simulation call for a tighter integration between the simulation and the data analysis. The goal is to reduce the data traffic and to speed-up result analysis by performing result processing (compression, indexation, analysis, visualization, etc.) as closely as possible to the locus and time of data generation. This approach, called in-situ analytics, requires to revisit the traditional workflow (batch processing followed by postmortem analysis). The application becomes a whole including the simulation, in-situ processing and I/Os. This motivates the development of adapted resource allocation strategies, data structures and parallel analytics schemes to efficiently interleave the execution of the different components of the application and globally improve the performance.
To tackle these issues, we intertwin theoretical research and practical developments in an agile mode, to elaborate solutions generic and effective enough to be of practical interest. Algorithms with performance guarantees are designed and experimented on large scale platforms with realistic usage scenarios developed with partner scientists or based on logs of the biggest available computing platforms. Conversely, our strong experimental expertise enables to feed theoretical models with sound hypotheses, to twist proven algorithms with practical heuristics that could be further retro-feeded into adequate theoretical models.
Integration of High Performance Computing and Data Analytics
The base idea behing in-situ processing is to perform data analytics as closely as possible to the running application, while ensuring a minimal impact on the simulation performance and the required modifications of its code. The analytics and the simulation thus needs to share resources (compute units, memory, network). Today solutions are mostly ad-hoc and defined by the programmer, relying on resource isolation (helper core or staging node), or time sharing (in-lined analytics or running asynchronously in a separate thread). A first topic we will address will focus on developing resource allocation strategies and algorithms the programmer can rely on to ensure an efficient collaboration between the simulation and the analytics to optimize resource usage. Parallel algorithms tailored for in-situ analytics also need to be investigated. In-situ processings inherit from the parallelization scale and data distribution adopted by the simulation, and must execute with minimal perturbations on the simulation execution (whose actua lresource usage is difficult to know a priori). This specific context calls for algorithms that rely on adaptive parallelization patterns and data structures. Cache oblivious or cache adaptive parallel data structures coupled with work stealing load balancing strategies are probably a sound basis. Also notice that the limited budgets of memory and data movements targeted by in-situ processing can be somehow compensated by an abundance of compute capabilities. As demonstrated by some early works, this balance can lead to develop in-situ efficient algorithms relying on original strategies that are not relevant in a classical context. In-situ creates a tighter loop between the scientist and her/his simulation. As such, an in-situ framework needs to be flexible to let the user define and deploy its own set of analysis. A manageable flexibility requires to favor simplicity and understandability, while still enabling an efficient use of parallel resources. We will further investigate these issues relying on the FlowVR framework that we designed with these goals in mind. Given the importance of users in this context, to validate our different developments we will tightly collaborate with scientists of some application domains, like molecular dynamics or fluid simulation, to design, develop, deploy and assess in-situ analytics scenarios. In-situ analytics is a specific workload, that needs to be scheduled very close to the simulation, but not necessarily active during the full extend of the simulation execution, and that may require to access data from previous runs. The scenarios we will develop will thus also be used to evaluate batch scheduling policies developed in the team.
Data Aware Batch Scheduling
The most common batch scheduling policy is First Come First Served (FCFS) with backfilling (BF). BF enables to fill idle spaces with smaller jobs while keeping the original order of FCFS. More advanced algorithms are seldom used on production platforms due to the gap between theoretical models and practical systems. In practice the job execution times depend on their allocation (due to communication interferences and heterogeneity in both computation and communication), while theoretical models of parallel jobs usually consider jobs as black boxes with a fixed execution time. Though interesting and powerful, the synchronous PRAM model, delay model, LogP model and their variants (such as hierarchical delay), are ill-suited to large scale parallelism on platforms where the cost of moving data is significant and non uniform. Recent studies are still refining these models to take into account communication contentions accurately while remaining tractable enough to provide a useful tool for algorithm design. When looking at theoretical scheduling problems, the generally accepted goal is to provide polynomial algorithms. However, with millions of processing cores where every process and data transfer have to be individually scheduled, polynomial algorithms are prohibitive as soon as the largest exponent is two. The model of parallel tasks simplifies this problem by bundling many threads and communications into single boxes, either rigid, rectangular or malleable. Yet these models are again ill-adapted to heterogeneous platforms, as the running time depends on more than simply the number of allotted resources, and some of the basic underlying assumptions on the speed-up functions (such as concavity) are not often valid in practice.
We aim at studying these problems, both theoretically and through simulations. We expect to improve on the existing models (on power for example) and design new approximation algorithms with different objectives such as stretch, reliability, throughput or energy consumption, while keeping in focus the need for a very low polynomial complexity. Realistic simulations are required to take into account the impact of allocations and assess the real behavior of algorithms.
Empirical Studies of Large Scale Plateforms
Experiments in realistic context is critical to bridge the gap between theoretical algorithms and practical solutions.
But to experiment resource allocation strategies on large scale platforms is extremely challenging due to their constrained availability and their complexity.
To circumvent this pitfall, we need to develop tools and methodologies for alternative empirical studies, from trace analysis, to simulation and emulation:
- Traces are a base tool to capture the behaviour of applications on production platforms. But, it is still a challenge
to instrument platforms, in particular to trace data movements at memory, network and I/O levels, with an acceptable performance impact, even if the granularity required for batch scheduling is rather coarse (seconds to hours time scale). Trace analysis is expected to give valuable insights to define models encompassing complex behaviours like network topology sensitivity, network congestion and resource interferences. Traces are also fundamental to calibrate simulations or emulations as well as for training algorithms relying on learning techniques.
- Emulation and simulation: Emulation techniques will also be considered to replay data movements and I/O transfers. The Grid'5000 infrastructure dedicated to experimentations plays a crucial role here, and we will pursue our participation in its developments. But experimenting on real systems is costly and time consuming. Moreover the number of platform configurations that can be evaluated is particularly limited. The obvious approach is to consider simulations. One challenge for simulators is to be able to simulate accurately resource behaviours upon different workloads directly generated from collected traces and coarse grain job models. We will in particular work on interfacing our production batch scheduler OAR with the SimGrid simulator developed by Polaris and Avalon.
Forecasting Resource Availability
It is well-known that optimization processes are strongly related to the (precise) knowledge of the problem parameters. However, the evolution of HPC architectures, applications and computing platforms leads to an increasing complexity. As a consequence, more and more data are produced (for monitoring CPU usage, I/O traffic, informations about energy consumption, etc.), by both the job management system (the characteristics of the jobs to be executed and those that have already been executed) and by analytics at the application level (parameters, results and temporary results). It is crucial to adapt the job management system to deal with the bad effects of uncertainties, which may be catastrophic in large scale heterogeneous HPC platforms.
More precisely, determining efficient allocation and scheduling strategies that can deal with complex systems and adapt to their evolutions is a strategic and difficult challenge. We propose to study new methods for a better prediction of the characteristics of the jobs and their execution to improve the optimization process. In particular, the methods studied in the field of big data (supervised Machine Learning, SVM, learning to rank techniques, etc.) could and must be used to improve job scheduling in the new HPC platforms. A preliminary study has been done with the target of predicting the job running times (SC'2015). We are interested in extending the panel of parameters (including for instance some indicators for energy consumption) and in developing new methods on stochastic optimization.
At the application level, it also appears as important to predict resource usage (compute units, memory, network), relying on the temporal and spatial coherency that most numerical simulations exhibit, to better schedule in-situ analytics, I/Os and more generally data mouvements.
International and industrial relations
Our transfer strategy is twofold: make most of our algorithms available to the community through open source software, and develop partnerships with private companies through direct contracts or collaborative projects funded by national or european agencies.
- Large companies. We have developed long term collaborations with large groups, like the HPC division of BULL/ATOS, or the R\&D scientific visualization team at the EDF power company. They already funded several PhDs. We develop with them state of the art solutions adapted to their specific problems. In particular we are collaborating with BULL/ATOS to contribute to the SLURM batch scheduler. SLURM is an open source job manager, widely deployed on HPC platforms and strongly supported by BULL/ATOS. We develop with BULL plugins to make our scheduling algorithms available to the broad community of SLURM users.
- Small and medium-sized enterprises (SMEs). Optimizing performance is a challenging problem for many SMEs that develop solutions for distributed computing. We will develop tight collaborations with some of them (TeamTo, ReactivIP, Stimergy, etc.), helping them to push forward their competitive edge.
Research teams of the same theme :
- ALPINES - Algorithms and parallel tools for integrated numerical simulations
- AVALON - Algorithms and Software Architectures for Distributed and HPC Platforms
- HIEPACS - High-End Parallel Algorithms for Challenging Numerical Simulations
- KERDATA - Scalable Storage for Clouds and Beyond
- POLARIS - Performance analysis and optimization of LARge Infrastructures and Systems
- ROMA - Optimisation des ressources : modèles, algorithmes et ordonnancement
- TADAAM - Topology-Aware System-Scale Data Management for High-Performance Computing