Sites Inria

Version française

Associate teams

Sophie Timsit - 6/06/2018

LEGO: teaching natural language to machines

CC0 - Pexels

Teaching machines to analyze language and words in a large volume of data in order to extract information and perform tasks... This is the objective of the LEGO associate team’s research: “LEarning GOod representations for automatic natural language processing.” Begun in January 2016 for a period of 3 years, LEGO brings together the Magnet project team* and the University of Southern California. It is a partnership which will soon come to an end but which promises further advances.

Computers needed to cope with the explosion of text data

 “Our project is part of the era of artificial intelligence and big data,” explains Aurélien Bellet, researcher with the Magnet project team. In January 2016, the Lille team launched LEGO, a team associated with researchers from the University of Southern California. LEGO is interested in information networks available on the Internet and in particular in large volumes of data with a textual dimension, such as Wikipedia, social networks and blogs. To analyze such large amounts of data, it has become much more efficient to teach machines to process natural language than to call upon humans. The idea is to use artificial intelligence techniques to generate internal representations of texts and words so computers can automatically interpret data and extract the information needed to perform calculations and tasks. However, as the researcher points out, “The challenge is significant because natural language is not a formal language like a programming language. There are contextual elements, double meanings, ambiguities, implicit cultural references and other details that are difficult for a machine to take into account.”

Our project is part of the era of artificial intelligence and big data.

Pushing the limits and performance of machines

 “We train machines to generate word representations in space so they can make semantic connections between words, study sentence syntax and thus understand the meaning of texts and even associated emotions and feelings.” This fundamental research can be applied to multiple concrete problems, such as researching and recommending texts such as press articles, or preventing social unrest and suicide through the automatic analysis of posts on social networks. More generally, it enables computers to extract and compare knowledge from texts and make connections between data and the texts that are analyzed. That’s not all. “We’re also working on automatic learning of textual representations with a visual dimension. We push the machines to approximate the representation of words according to their frequency of co-occurrence in large corpus of texts, as well as their visual similarities.” For this, in addition to the text corpus, the machines are trained on image banks such as ImageNet. “The objective is to enrich the conceptual representation of the text with a visual dimension. This also improves natural language image search tools.”

Aurélien Bellet - Inria / Photo C. Morel

A long-standing partnership on automatic  

The story begins with Bellet's post-doctoral work, completed in 2013-14 with Professor Fei Sha's team at the University of Southern California before he was hired by Inria. “The creation of LEGO has allowed us to perpetuate, formalize and strengthen this partnership, as well as to involve the skills of the Magnet team that I joined after completing my post-doctorate work.” Both teams work on fundamental aspects of machine learning, a major branch of artificial intelligence. To carry out the LEGO project, the Magnet team brings its skills in linguistics and language structuring, while Professor Fei Sha's team contributes its experience in deep learning and image processing. Their collaboration on the textual and visual aspects of natural language representation emerged with the creation of the associate team.

There are contextual elements, double meanings, ambiguities, implicit cultural references and other details that are difficult for a machine to take into account.

Renewal to be requested for the associate team

“The benefits of the LEGO project are very positive,” confirms Bellet. To such an extent that Mélissa Ailem, a Ph.D. graduate from the University of Paris-Descartes, was able to obtain an Inria@SiliconValley post-doctoral fellowship to focus on the problems of the associate team. “Mélissa offers very specific expertise on probabilistic models and makes the link between our two laboratories...” The associate team will come to an end at the end of the year, but the researchers plan to request its renewal, particularly to explore the automatic transfer of representations from one language to another. “This line of research is very important, because for some little-used languages, there is no body of texts large enough to serve as a reference.” Researchers would also like to study the evolution of natural language over time. “The meaning of words evolves, new words appear, and it is essential to take these changes into account for dynamic representations of language.”

Inria associate teams

An “associate team” is a joint research project by an Inria project team and a research team based abroad. The two partners jointly define a scientific objective, research plan and bilateral exchange program. In order to promote and develop such partnerships by supporting top-notch research projects, Inria's European and International Partnerships Directorate (DPEI) has an annual call for projects.

Inria@SiliconValley

Inria@SiliconValley structures and strengthens research and innovation efforts between Inria and its partners in California on joint projects with transatlantic impact. It is based on Inria's various research and international travel programs, which include support for some thirty joint research projects since 2011 (associate teams), sabbatical stays by Inria researchers, the Inria International Chair and the hiring of post-doctoral fellows.

*The joint Magnet project team includes researchers from CNRS, University of Lille - Human and Social Sciences and University of Lille - Sciences et Technologies. It is part of joint research unit UMR 9189 - CNRS-Centrale Lille-University of Lille1 − Sciences and Technologies, CRIStAL.

Keywords: Associate teams Machine learning Natural-language processing

Top