Unifying programming on heterogeneous multi-core processors: the challenge of Caroline Collange

Changed on 19/06/2020

In order to meet ever greater demand in terms of speed and power, manufacturers are adopting the use of heterogeneous processors, which are now home to two types of circuits optimised for different tasks. There is just one problem: each of these uses a specific programming model that is not compatible with the other. Caroline Collange is a scientist at the Inria Rennes - Bretagne Atlantique centre who was awarded a Young Researcher grant by the French National Research Agency. Her aim is to introduce a single instruction set in order to unify programming, thus improving overall performance and energy efficiency.

Processors are required to carry out ever more tasks and are expected to do so at ever greater speeds. The latest models are referred to as “heterogeneous” and are equipped with two types of circuits: a multi-core scalar processor, which carries out operations sequentially, i.e. one after another and a GPU. Initially designed to accelerate display performance, these graphic processors assign massively parallel operations. Over time, they have also become highly practical when it comes to performing non-specialist tasks. This is known as GPGPU. Depending on the type of task, processing is oriented towards either one of these zones. This is the current state of play.

The only snag here is that a GPU isn’t programmed in the same way as a scalar CPU. What this means is that manufacturers have to find some way of getting two different software environments to work alongside each other in order to manage two instruction set architectures (or ISAs). This makes it difficult to optimise the distribution of tasks and performance without programming turning into a nightmare.

Retaining one single programming model

How can this obstacle be overcome? “By retaining one single programming model, one single instruction set on a device, where you can choose to use cores geared towards either sequential performance or parallel performance,” explains Caroline Collange. The first delivers low latency times, while the second is capable of handling large volumes.” In this particular instance, the researcher is planning to retain “the CPU instruction set, only with the GPU execution model.”

In practice, programming can be performed by any parallel programming environment for CPUs, such as OpenMP. This provides enhanced flexibility and compatibility.

Grouping together identical operations

One cornerstone of this new method is dynamic vectorisation for thread execution at a micro-architecture level. “Vectorisation involves grouping together threads performing similar types of operations, such as additions, for example. The advantage of this is that instructions are only read once. Everything relating to processing and control takes place only once. Factorising highly redundant operations in this way eliminates the cost of scalar management.” Vectorisation then takes place transparently, without the compiler having to intervene.

The concept of grouping together identical operations carried out on different data is nothing new: Ada Lovelace first thought of it in 1842! It is a key principle of current GPUs. But our work will make it possible to apply inter-thread vectorisation to universal instruction sets.

Threads will thus be able to migrate entirely flexibly between different types of core.

The innovation will also affect memory access. “Programmes optimised for GPUs ensure that threads access consecutive data in the memory. But that’s not how it currently works on CPUs, where each thread works on its own. When threads are centralised, you end up with locations which are physically far apart from each other, which is not at all efficient. As a result, you need to add an abstraction layer in order to bring data which are logically far apart from each other closer together physically in the memory.”

A factor of 10

What sort of gains can we expect? “It will all depend on the context and on the application. You’re looking at somewhere between the performance level of a CPU and that of a GPU, but ideally closer to a GPU. There is currently a factor of 10 between the two. That is the order of magnitude we’ll be working with, only there will be no need to rewrite the code and no extra costs to pay.”

This is one of the strengths of this project.

Manufacturers currently spend a lot of money on this multiple programming. Generalising the execution model will reduce these costs.

A phone manufacturer, for example, would also be able to get new products on the market quicker. To top it off, the phones themselves would benefit indirectly from greater autonomy, with dynamic vectorisation cutting down on the amount of energy used as a result of threads being grouped together.

This research could end up with a patent being filed for the computer architecture. “We would also like to focus on compilation. This would enable us to offer specific types of optimisation for dynamic inter-thread vectorisation. Restructuring loops, for example, would enable more identical operations to take place between different threads. The problem lies in knowing when to apply such transformations. Sometimes it's beneficial, but other times it’s not. We will need to develop static analysis in order to enable the compiler to make the right decision on a case by case basis.”

Tests carried out on simulators will make it possible to explore these avenues in greater detail. The project that has just started is set to run for a period of 42 months. “The French National Research Agency will fund two years of postdoctoral research and a PhD. We are also looking for a PhD student for the start of the 2020/2021 academic year.”

Caroline Collange is a member of Pacap, a team based in Rennes specialising in processors and compilers.
The research project is called DYVE, an acronym in English which stands for: dynamic vectorisation for heterogeneous multi-core processors with single instruction sets.