Sites Inria

Version française

MOSTRARE Research team

Modeling Tree Structures, Machine Learning, and Information Extraction

  • Leader : Joachim Niehren
  • Research center(s) : CRI Lille - Nord Europe
  • Field : Perception, Cognition, Interaction
  • Theme : Knowledge and Data Representation and Management
  • Partner(s) : Université des sciences et technologies de Lille (Lille 1),CNRS,Université Charles de Gaulle (Lille 3)
  • Collaborator(s) : Laboratoire d'informatique fondamentale de Lille (LIFL) (UMR8022)

Team presentation

During the last decade, the World Wide Web has evolved to the most important public data store on world. The Web is huge: it contains billions of public pages and millions of users. The recent data formats used on the Web are heterogeneous and still evolving. There is pure text in natural language, \html documents which add layout structure to pure text, semistructured \xml data where information structure is available, and also all kind of binary data formats.

The web community is highly interested in adequate information representing so that information on the Web can be accessed more easily. We therefore suppose that the amount of available \xml documents will increase in the near future. A motivating power might be the semantic web initiative which heads for adding semantic information to semistructured data.

Existing information extraction systems are by far too limited to account for accurate question answering for the Web. Better algorithms are needed so that machines can learn enough about what the data means. A major challenge in that perspective is adaptive information extraction that can exploit the tree structure of web documents. Tree structure is available in the recent Web formats, html and xml, to encompassed textual information. In this project, we want to integrate tree structures and emerging machine learning techniques into adaptive information extraction systems. In the future, we might also have to account for semantic information.

Research themes

Existing information extraction systems are unsatisfactory. Moreover the standard document formats of the Web today, html and xml, rely on tree structures that wrappers dont take into account. Therefore the main objective of the present project is to develop automatic information extraction systems for semistructured data that make use of the underlying tree structure. In the future, we might also have to incorporate semantic information. The goals are the following:
  • Tree structures for information extraction: the definition of adequate models and the design of efficient algorithms for tree structures in the field of information extraction. An example is the definition of tree wrappers.
  • Tree structures in machine learning: the development of machine learning algorithms for induction of tree wrappers. Their applications in information retrieval and text classification. The combination of learning algorithms to define wrappers over diverse data sources and over heterogenous data.

International and industrial relations

  • Industrial partner: Innovimax, SAP Research, XRCE - Xerox Grenoble
  • International scientific relations: NICTA Sydney, Oxford, Barcelona, Girona, Frankfurt.

Keywords: Machine learning; tree automata; logic; constraints