Simulating Large-Scale Distributed Systems
Martin Quinson, membre de l'équipe MYRIADS
Supported by French research institute Inria, SimGrid is an open source tool for the simulation of distributed systems. Over the last 15 years, it has become a staple in more than one scientific community across the globe, contributing to performance optimization in many contexts. The next challenge is to help SimGrid reach the industry, an effort for which Inria is about to start a two-year Technical Action, as project coordinator Martin Quinson explains.
From P2P to HPC, from grids to clouds, distributed computing has become an essential cog in the machinery that underpins the digital age. Paradoxically, though, assessing the performance of applications running on such systems remains rather empirical –the main reason being that it proves hard to conduct real-world reproducible experiments on widely distributed platforms. Resorting to simulation is therefore a much more practical solution, provided one is equipped with the proper tool kit.
Started in 1999, “ SimGrid is a scientific instrument designed to study the behavior of large-scale distributed systems, says Martin Quinson, one of the three researchers responsible for coordinating the project (1). You could say it's like a microscope. Scientists use it for all kind of work in the field. It grounds the experiments of thirty-some papers each year. Although there are scores of highly specialized simulators that were built over short periods of time, mostly by PhD students, very few tools have actually being developed over the long haul and thus reached SimGrid's maturity. I would count less than three, worldwide. One of the reason behind this achievement is the fact that we have benefited from the unrelenting support of Inria during all those years in terms of commitment of engineers, infrastructures, so on and so forth. In addition to that, two of our research projects were funded by the French National Research Agency (ANR) for a total amount of over 2M€.”
Fully Exploit the Resource at Hand
One of the interest of simulating a distributed infrastructure is to ensure that the application fully exploits the resource at hand. “I once collaborated with drug designers. Their work consisted in coupling an active molecule to a transport molecule. Finding the good combination among billions possibilities called for weeks of computing. At the time, their software happened to have a flaw so that it only leveraged but a fraction of the computing power and they didn't even know it. They just had no means to ascertain whether they made a reasoned use of the machine or not. This is a typical situation where SimGrid could produce useful performance projections to dectect such resource misusage.”
SimGrid is used primarily by scientists for testing their own applications. “And a lot of these, naturally, are . . . simulators, be they for the study of physics, meteorology, or whatnot. So, in practical terms, SimGrid ends up simulating the machine, the information system on top of which a simulator is running.” The instrument entails both validated performance models of modern distributed platforms, and a full formal verification framework. The former models can be used for quantitative studies, e.g. to pinpoint performance and latency issues while the latter mechanism can be used to systematically search for qualitative bugs.
The aim of Inria's new Technological Action is to go one step further downstream, bringing the simulator to the industry. “For companies to adopt a tool born from academic research, they need confidence and the assurance of a long term development, which makes Inria's suport all the more important. The Technological Action initiated by the institute will enable us to hire an experienced engineer for a period of two years.
” Top on the assignment list lies the support to the community of users, and in particular the new comers. “Actually, we already have a few users in the R&D departments of several companies.
This software-industrialization effort is expected to culminate with the building of a consortium meant to federate corporate organizations interested not just in adopting the tool but also in contributing to its maintenance. “If we could have 5 to 10 engineers from various companies devoting a portion of their time to SimGrid development, it would be a great first step toward an OpenStack-like consortium, ” Quinson concludes.
Model-Checking Distributed Applications
“Remarkably enough, this project covers the whole spectrum going from fundamental research –including formal verification– to deployment in production,” Quinson remarks. Research-wise, the next hot topic regards SimGrid's debugger. “The problem with distributed applications is that they could work flawlessly 1,000 times and suffer from a bug at the 1,001th time.” Why? “Because they are dynamic systems in essence. They come with latency issues. The execution is never really twice the same. That's why we are contemplating the model-checking of distributed applications.” In other words: the exhaustive testing of all the execution paths. “We are the only ones in the world to study liveness in distributed applications. That's the future so far academic research is concerned. But of course the industry will have to wait for that particular feature.”