Finding Nuggets Mining Data
Whereas the digital economy is churning out ever more dizzying volumes of information, the smart exploitation of such data bonanza paradoxically laggs way behind. Indeed data mining is faced with a glaring lack of adequate tools. More often than not, what is also referred to as Knowledge Discovery in Data (KDD) remains a predominantly manual process. At Inria research center, in Rennes, Brittany, France, a group of data scientists led by Alexandre Termier is mulling over novel approaches meant not only to automate the exploration workflow but also to improve results visualization and to put the user in the loop.
“Our job? Inventing tools that will help people to explore their data and to discover information they didn't even suspect existed among it, sums up Alexandre Termier, head of the upcoming Lacodam research team. Hand us all the receipts issued by a supermarket over a period of say two years. By plowing through the wad, we might well find out that the clients who purchase plain yogurts are also those who buy leeks, golden apples or whatnot. Kwowledge discovery in data is on the roll these days. And all the more so with the advent of the famous big data. ”
Indeed “companies now generate huge amounts of information that would deserve to be analyzed... provided one could effectively manage to discover new and really useful information. For, if exploration only comes up with self-evident trivia, then it will be of little use to the industry. In addition to that, in and of itself, mining doesn't come easy. There is no magic button yet that users could push to get their data efficiently explored. Hopefully though, we will be able to supply one in the coming years. ”
One of the major stumbling blocks results from the sheer mass of data to be mined. “Let's return to our supermarket example. If I search for all the occurrences of product combinations in the receipts, it will return millions of results. Therefore finding those patterns calls for a lot of computation and lengthy runtimes. ” In recent years, scientists have come up with ParaMiner. Leveraging the power of multi-core architectures, this parallel pattern mining algorithm has achieved significant reduction in execution time. “Two orders of magnitude in some cases. ”
Despite these inroads, coping with the deluge remains very sobering a challenge. “We happen to work for instance with Genscale, a bioinformatics research team at Inria center in Rennes. The genomic data generated by DNA sequence machines is so complex that it defeats all our algorithms, including some I was pretty proud of until now. None is up to the task. ” How come? “The complexity of our algorithms is exponential to the number of columns in the matrices to be be processed. Add one single column and complexity doubles. Sure enough, at the 20,000th column, the tool is out of breath. Then it capitulates. Yet, in bioinformatics, columns can number... in the millions! ”
Another daunting hurdle regards genericity. “More often than not, data mining algorithms are the product of a specific optimization depending on what there are meant for. They suit some tasks better than others. They perform more or less efficiently according to the nature of the data. Therefore, it's up to the user to pick up the most adequate tool. ” And that's where things get decidedly tricky. “Before being able to re-utilize an algorithm that was once designed by someone else, the user must adapt this specialized tool to the task they wish to perform on their own dataset. But this customization work is unbearably arduous. ”
Easier Acces to Algo-diversity
Consequently, in the real world, users only exploit but a fraction of the array of tools available to them. “We have a magnificent zoo of algorithms. But nobody comes to visit it. People stop at the first cage. ” Namely? “Item set frequency algorithms. Those are meant to find out repetitions in the data. Everybody knows them. Everybody uses them. They even come integrated within major editors' software. On the other hand, the rest of the zoo remains completely ignored, which I deeply regret for there exist much richer tools that allow way more expressive operations! ” Scientists are contemplating a software that would give the end user easier acces to such algo-diversity. This unifying framework would capture all the components of the workflow and allow to compose them, hence enabling to pick the most suited operator to perform one particular task on one given dataset. In other words: “One algorithm to rule them all. ”
Also of concern is the fact that “algorithms are not sufficiently selective since the user can't manage to supply an accurate definition of what the tool is supposed to look for. ” So, how to deal with this issue? “One option consists in helping the user to formulate a more discriminating definition. There exist somewhat mechanical methods for trying to highlight the patterns that are supposed to be the most interesting to show. On this line of research, we have a partnership with chip builder STMicroelectronics. It aims at helping processor developers to find bugs. For such purpose, one must explore the processor's execution traces. ” Manually finding the relevant information burried deep inside such heap of arcane data is everything but a day at the beach. “We are testing optimization techniques with goal of finding a set of say ten patterns that would enable us to re-write the trace so that the developer could better visualize it and thus save considerable time. ”
A Massively Cooperative Online Hub
Beyond the multi-faceted effort toward automating the data-crunching process at different stages of the workflow, the Lacodam teams also plans to introduce a user-in-the-loop approach, which would be a novelty a field. “Most of the time, people are required to put up an awfull lot of work before —at last— being able to start making sense of their data. But this user input is not really leveraged the way it should be. Basically it gets lost. People can't benefit from each other's experience. We would like to make the user a first class citizen and tap into their knowledge of a domain. We envision a massively cooperative online hub. Data analysts of all walk of life would log on, inject domain knowledge, improve dataset, bookmark the best workflows, so on and so forth. ” This community momentum would decrease the amount of work required for each individual. “The combination of the input knowledge and the feedback on mining workflows would greatly enrich our system. ”
These articles could interest you:
Lacodam is a research team of Inria, Insa Rennes and Rennes 1 University, common to Irisa (UMR 6074). It succeeds to the former Dream project-team.