Industry of the Future

Rethinking Hardware Reliability for Trustworthy AI

Date:
Changed on 08/11/2023
Artificial Intelligence is now often running on embedded systems where it is increasingly asked to make decisions in safety-critical applications. Hence the need to ensure that the computing device upon which the software runs does not bias the decision-making or compromise prediction accuracy whenever a hardware transient fault occurs. But the classical methods used to mitigate these faults are not really suitable for AI due to the large number of calculations involved. Therefore, a new approach is needed. Funded by the French Research Agency, Re-Trusting is a project aimed to perform failure analysis and develop fault models that will help usher better adapted strategies for tackling this scalability challenge.
Prototypes Plug DB en open hardware
Inria / Photo C. Morel

 

Every now and then in the real life of processors, things happen that may result into transient errors. Cosmic radiations would be a classical example. Coming from the outer space, a particle could hit the electronic circuitry and occasionally provoke a so-called ‘bit flip’. In other words, a bit of a binary code that is supposed to be a 0, all of a sudden and very unduly becomes a 1, thus potentially snowballing into a malfunctioning further down the line.

Image
Portrait Angeliki Kritikakou - TARAN - hardware - IA
Verbatim

With Artificial Intelligence (AI) now performing safety-critical tasks in embedded systems, such a hardware error could have dire consequences. For instance, in an autonomous car, the driving AI could conceivably mistake a pedestrian crossing the street for a just bird.

Auteur

Angeliki Kritikakou

Poste

Associate Professor (University of Rennes) - TARAN team

In order to ensure hardware reliability despite bit flips and other transient errors, the classical method used in critical fields such as avionics or satellites turns to redundancy by triplication. A calculation is performed not just once but three times so that if one output value happens to differ from the other two, it is just voted out as an error. “Yet, in the case of AI, this method no longer applies due to the sheer number of calculations involved. It doesn’t scale. One cannot triplicate every single calculation. It is just too costly. In embedded systems, energy frugality is paramount. So we need a new approach. And that’s the purpose of  Re-Trusting.

Funded by the French Research Agency (ANR), the project is coordinated by the Institute of Nanotechnology of Lyon (INL - mixed research unit CNRSECLINSA LyonUniversité Lyon 1 et CPE Lyon). In addition to Inria, it also includes the Sorbonne University Computer Laboratory (LIP6-CNRS) et Thales.

Visuel
Miniature podcast Angeliki Kritikakou - TARAN - hardware
Titre du lecteur

Rethinking Hardware Reliability for Trustworthy AI (french)

Fichier audio
Audio file

The Chip Industry has Started Rolling Out a Range of Custom Hardware Accelerators

Re-Trusting comes in a context in which a growing number of AI applications that used to run on the on-ground servers of the cloud now migrate into edge device. including mobile phones and the Internet of Things, in order to alleviate communication latency among other reasons. Echoing this trend, the chip industry has started rolling out a range of custom hardware accelerators meant to support the computational needs of these embedded resource-hungry deep learning algorithms.

The project precisely focuses on such accelerators. “We have two of them, says Inria scientist Marcello Traiola. One provided by LIP6. The other one by Thales. Our case study also comprises two different types of deep learning algorithms: a Deep Neural Network (DNN) on the one hand and a Spiking Neural Network (SNN) on the other. In essence, they do the same thing but in a slightly different way.

Image
Portrait Marcello Traiola - TARAN - hardware - IA
Verbatim

Our primary goal is to come up with a methodology as generic as possible for addressing the problem of how to best assess the reliability of the hardware+AI system.

Auteur

Marcello Traiola

Poste

Inria research associate - TARAN team

A key concept here is fault severity. In the context of huge neural networks, not every hardware fault will be malignant. Many will actually prove benign. What researchers have in mind is a model analysis of the hardware+software system that could grade fault severity. “Then usual fault-tolerance mechanisms such as triplication could be applied selectively where they are truly needed, thus keeping the extra cost minimal.”

Visuel
Miniature podcast Marcello Traiola - TARAN - hardware
Titre du lecteur

Specialized hardware gas pedals (french)

Fichier audio
Audio file

Analysis Is Challenging Due to the Huge Exploration Space

However, such an analysis is a challenge in and of itself. “We are faced with a huge exploration space. It would be hard to go through in its totality. But as huge as it might be, the system comprises different parts that are not all equally important. So we must find smart and systematic approaches capable of identifying the most important areas of this space. And then explore locally these points as much as possible.” In practical terms, for the scope of this project, experiments of fault injection are performed through software simulation.

“Getting a full understanding of the fault propagation actually calls for a cross-layer analysis, Kritikakou points out. It’s a bottom-up approach. LIP6 will focus on the lowest level: transistors, logical gates and so on. Once all the components are characterized, Inria will pick up from there and study the impact of faults at algorithmic level. In particular, we need to come up with a whole set of metrics that simply do not exist at the present time. Leveraging these research findings, INL will then work on devising a series of fault-tolerance mechanisms capable of protecting the hardware+AI system.”

Ultimately, these fault-tolerance techniques will be integrated into both accelerators by Thales, the partner in charge of the benchmarking work package. At the end of the project, by the fall of 2025, the set goal is to achieve a 100% fault coverage with no more than a 10% energy increase and no more than 10% of the hardware resource being allocated to protection tasks.