Mammals: Memory-Augmented Models for low-latency Machine Learning Services

Changed on 15/04/2021

Mammals is investigating new approaches to perform machine learning model inference in a very short time frame and with limited resources.

What are your main research topics?

A machine learning (ML) model is often trained for inference purposes, i.e. to classify specific inputs (e.g. images) or to predict numerical values (e.g. the future position of a vehicle). The ubiquitous deployment of ML in time-critical applications and unpredictable environments poses fundamental challenges to ML inference. Big cloud providers, such as Amazon, Microsoft, and Google, offer their “machine learning as a service” solutions, but running the models in the cloud may fail to meet the tight delay constraints (≤10 ms) of future 5G services, e.g., for connected and autonomous cars, industrial robotics, mobile gaming, augmented and virtual reality. Such requirements can only be met by running ML inference directly at the edge of the network—directly on users’ devices or at nearby servers—without the computing and storage capabilities of the cloud. Privacy and data ownership also call for inference at the edge.

Mammals investigates new approaches to run inference under tight delay constraints and with limited resources. In particular, it aims to provide low-latency inferences by running—close to the end user—simple machine ML models that can also take advantage of a (small) local datastore of examples. The focus is on algorithms to learn online what to store locally to improve inference quality and adapt to the specific context.

How is this project exploratory?

The current approach to run inference at the edge is to take large ML models (often neural networks) and generate smaller ones through compression or distillation. Mammals explores a different direction: take advantage of data availability at the edge (where data is usually generated) to compensate for tighter computing constraints. In particular, Mammals aims to combine the decisions of a small ML model, e.g., a compressed neural network, with those of an instance-based algorithm that retrieves from a local datastore examples similar to the current input.

In some sense, we can say that the simple ML model provides the general rule, while the instance-based algorithm retrieves the relevant exceptions from the datastore.

This approach appears very promising because:

it takes advantage of data availability at the edge;
it empowers simple local model personalization by updating the local datastore;
it can rapidly adjust to a changing distribution of inference tasks;
it exploits instance-based algorithms' fine-tuning of memory and computation requirements to device's capabilities.

Is it rather a subject of basic or applied research?

Mammals starts from a real-world problem, but it focuses on developing general methodologies to solve it. Hopefully, it will also lead to deepen our understanding of the relation between memorization (the local datastore memorizes previously observed patterns) and generalization (the capability to extract general inference rules), that is still wanting.

Who are your partners?

At the moment, Mammals is based on a number of pairwise collaborations both with academic partners (Università degli Studi di Torino, Politecnico di Torino, University of Massachusetts - Amherst, Northeastern University, Università degli Studi di Verona) and industrial ones (Nokia Bell Labs).

Giovanni Neglia

Web page