A post-Hadoop middleware program for managing data flows

Date:
Changed on 02/01/2020
An opensource software program first launched in 2006 to manage big data, Hadoop quickly established itself as the tool of choice for web giants. Perfectly suited to data batches, it was found to be less suited, however, to processing flows of data, including those produced by driverless vehicles. Alternatives were developed, but all of these came up against bottlenecks. Funded by the ANR (the French National Research Agency) and directed by Inria researcher Shadi Ibrahim, the aim of the KerStream project is to overcome the obstacles standing in the way of managing data flows.
Illustration serveurs
© Inria / Photo C. Morel

There was a time when supercomputer structures were divided into two sections. Machines for storing data, and other machines to process it. In order to function, the data had to be moved between the two machines, which took up a significant amount of time. That all changed with the publication of two leading articles by Google in 2003 and 2004. “The first introduced GFS (the Google File System) ”, explains Shadi Ibrahim, a member of the Stack team at the Inria Rennes – Bretagne Atlantique centre. “From that point onwards, large files would be divided into small pieces distributed across several servers, with each piece located on a server and duplicated roughly three times. In the event of a crash, the piece of data would still be available on another server. ” This is doubly advantageous, as “you are able to access the data in parallel, meaning you can read the file quicker. It also speeds up processing. ” 

Indeed, processing was the subject of the second article, which dealt withMap Reduce: an innovative method for the management of big data.“Google had all of these machines used to store data that were now highly available thanks to multiple servers. They then realised that these machines could be used not only for storage, but for computing purposes as well, thus eliminating the need to transfer via the network. ” 

The engine driving the cloud

Not long after that, Doug Cutting, a programmer inspired by these two publications, developed an opensource middleware program for the purposes of implementing Map Reduce and facilitating the creation of applications using this paradigm. Its name is Hadoop. Widely adopted in the industry world, it quickly became “the engine driving the cloud. ” 

There was, however, one limit to it : “it was unable to manage applications processing data flows. Instead, it was only able to process data of a certain size already stored within the file system. The issue here is that many applications now generate data flows on a mass scale, which have to be processed as soon as they arrive. Take driverless vehicles, for example. As their sensors collect information, they must be able to make quick decisions. There’s no time to store data: it has to be processed immediately. ” 

Spark, Storm and Flink

A number of alternative solutions capable of handling these data flows have been developed. “The most widely used among these are  SparkStorm and  Flink. The first two originated in the USA, while the third was designed by the Berlin Institute of Technology, but these have all encountered their fair share of problems. ” 

The KerStream project  deals with three of them. It began by looking at the way in which these models consider resources and tasks.“They set out believing machines to be uniform and that all tasks have the same cost. In reality, however, neither machines nor tasks are homogeneous. You might have different machines on your cluster, for example, or different performance levels on different machines in cases where other applications are sharing the resource. Furthermore, different tasks will have varying degrees of complexity, and workloads will also vary over time. As such, the dynamism of workloads has to be taken into account, something current models aren’t capable of doing. ” 

Detecting stragglers

The second area of research aims to handle the problem of stragglers, i.e. tasks that take more time to run than others “In order to detect stragglers, Hadoop looks at the completion rate of different tasks while they are being run, identifying, for example, those progressing 20% slower than the average. Such tasks are then classed as being late, but this technique works poorly. ” Why? “Because, by definition, some tasks take more time to complete than others. As a result, you have false detections. Initially, this wasn’t too much of an issue, because Hadoop and Map Reduce were used for data in batches that tend to be run over a long period of time. It sometimes takes thousands of seconds to complete a task and whole days for processing. However, with data flows, this takes us to another timescale altogether, moving into milliseconds or even microseconds. Hadoop’s inherent problem is amplified by the fact that the question becomes how to operate critical applications sensitive to time factors. When we studied current detection methods, we found that they weren’t very good. ” 

This realisation led to the introduction of a set of metrics capable of helping researchers to characterise stragglers and to build new detection mechanisms. These metrics are as follows : precision, reminders, detection latency, the pre-run time of undetected stragglers and false positives.“The first concerns precision: how many tasks classified as running late actually are? You want to be operating at 100%. The second concerns reminders: how many stragglers have gone undetected? By using these new metrics, we were able to develop a new detection mechanism. It improved precision, but we continued to miss a number of stragglers. We decided to continue working on the subject. ” 

At the same time, they realised that “when Hadoop finds stragglers, it waits for the resource to become available again before making a copy. It doesn’t seek to make the best allocation for this task, despite its criticality. What then happens is that the copy is made on a slow server, which can in turn lead to another straggler. ” 

Relaunching failed tasks

With the new method,“instead of waiting, once we have detected the straggler and checked to ensure that it has been correctly classified, we look for the resource that will enable us to run the task straight away and more quickly. When it comes to scheduling, we asked ourselves two questions: where and when should this speculative task be carried out?” For the time being, this research is restricted to Hadoop, but the researchers intend to move on to Spark. “In Spark, there are many factors which influence running times. Working initially on Hadoop enabled us to put these factors to one side and to focus on the problem at hand. We will now be able to transpose our results onto applications processing data flows. ” 

The third and final subject also concerns errors.“With data flows, it is essential to be able to quickly identify and repair failed tasks. We have only a few microseconds. Spark uses the recomputing system taken from Hadoop, but this also takes time. In Flink, synchronous control points are used, which adds management, given that it is necessary to stop the application that is running in order to wait for the distributed snapshots to be brought together. As such, we are testing something that would be hybrid, something with either asynchronous re-execution or synchronous control points, depending on application needs. This would be an adaptive, fully automated solution that would select the best technique for managing errors, depending on the characteristics of the application and the configuration of the system. ” 

Launched in 2017, the KerStream project has already met with success.“More and more people are working on it, including in China, Singapore and the USA. Inria and the Lawrence Berkeley National Laboratory also recently put together a joint team to operate in this field. ” Given that the methods developed are generic, these could be applied to different middleware programs for already existing data flows.