Serge Abiteboul: Webdam, improved web data management

Changed on 25/03/2020

The database theory on which data management systems are based is not appropriate for the web. Within his Webdam project, funded by the European Research Council (ERC) since December 2008, Serge Abiteboul proposes developing suitable mathematical foundations with, as the outcome, better quality of service and improved performance.

It was while teaching distributed data management at Stanford University between 1995 and 1997 that I realised how the theory ceases to apply when distribution starts, Serge Abiteboul relates. The field at the time consisted of a bunch of recipes, a set of techniques that were sometimes not even consistent." Nothing remarkable about that. It had all happened very fast. The web experienced incredible growth in about fifteen years, with billions of pages on line, accessible to billions of users across millions of independent servers communicating over a network. Querying, exchanging, sharing and updating this ocean of data means managing a set of complex and flexible interactions making use of disparate machines and a variety of operating systems and languages. All the more so since for the last few years, peer-to-peer (P2P) has to be included, user communities pooling their machines to manage data. Theory is struggling to keep up. The web is nonetheless based on a few quality standards, such as XML for data interchange, RDF to describe data types and semantics, and web services to enable non-uniform machines to communicate. But data management struggles to keep up and, for instance, it is near-impossible to really delete published data. It is also very complicated to check whether data is approved or access rights properly observed. "Within Webdam, we are going to try to create a unified mathematical view of this distributed universe," explains Serge Abiteboul.

The ERC grant of €2.4 million over five years is a unique opportunity to meet this ambitious challenge, with French researchers, yet while also attracting brilliant scientists from Europe, the US or elsewhere.

The aim is to manage to describe web data management applications more formally, making automated logic paths possible. To achieve this, he is taking inspiration from relational database management software that is able to process huge quantities of data on a centralised server. Developed from the 60s onwards, they have been the norm for data storage. They ensure strict conditions of usage and verification. They are built on a thorough mathematical foundation, with a simple logic allowing data to be described and used. The basic structure, the two-dimensional table, nonetheless bears little relation to the tree and graphical structures of the web where pages interact in complex ways. To say nothing of how
on the web, servers are independent and not uniform at all. The challenge is huge.

Managing for greater responsiveness

Who will benefit? "On the face of it, the average user will not notice, in everyday use, the scale of this guaranteed, safe management," replies Serge Abiteboul. "The benefit is mainly to programmers who will be able to develop applications more easily and be more responsive, which is a benefit in many sectors, such as finance." Improving web performance is moreover expected to support the continued growth in data volumes, especially with the development of peer-to-peer networks. For this database theoretician, it will also ensure being able to teach this new technology with the thoroughness demanded of other disciplines, such as mathematics.

"We will show the benefit of the Semantic Web in real applications."

Patrick Giroux , an architect at EADS DS, is in charge of the architecture for the multimedia document retrieval platform developed under the ANR WebContent project, initiated by Serge Abiteboul.

WebContent is a software platform which should facilitate the development of applications using web content. The 19 industry and academic partners are developing, on EADS' WebLab architecture, components that will make it possible to collect, store, modify and, in a word, use multimedia content from the web consistently, flexibly and generically. One of the first four pilot applications is Airbus commercial and technological intelligence. WebContent is primarily based on the semantic web, in other words, the exchanging of knowledge rather than mere data. All the software components are developed within this formal framework to which Inria has made a large contribution. The GEMO team provides a (P2P) storage service for all types of data (documents, concepts, relationships, etc.).