Data analysis

Reading between the lines of real estate intelligence

Changed on 09/04/2024
"3 bed flat for sale, Cannes Banana near Suquet, a stone away from the sea." With such listings parading by, one might almost forget to wonder why real estate listings resemble more of a treasure hunt than plain-speaking fact sheets. This odd game has an aim: to locate the property without truly locating it. For both buyers and real estate stakeholders, everything unfolds between these lines and periphrases, but transforming this big blur into actionable real estate intelligence is no simple task. Lucie Cadorel was oblivious to these Semantic Web challenges until SepteoProptech offered her a PhD position in partnership with Inria and i3S. For a young computer scientist with a detective's mind, the call was irresistible : here is her story with its twists and turns, one PhD year at a time.
Lucie Cadorel à Cannes


I did fall into the engineering pot at a very young age, but my heart still swings between research and industry. After my engineering school in Rennes and an internship at Orange in Mougins, I was introduced on LinkedIn to Septeo Proptech, a French real estate unicorn in search of a computer science doctoral student as part of the Cifre program. Real estate being of high interest to many, myself included, I was strongly intrigued by this opportunity to venture into research for the sake of innovation.

The company's mission was clear: to better support real estate agents throughout the sales cycle with reliable evaluation solutions. But knowing that 75% of propery listings hide the invaluable GPS coordinates of a property, agents are unable to run a detailed analysis of properties for sale. The challenge intrigued me as much as its potential impact: how to transform both data and the absence of data into valuable smart data for real estate agents?

I was equally excited at the prospect of joining the great Wimmics team led by Fabien Gandon at the Centre Inria d'Université Côte d'Azur, and following in the footsteps of the eminent Rose Dieng-Kuntz, mother of the Semantic Web. But a good story needs its share of twists and turns, so instead of walking the pine forest of Sophia Antipolis, I was to meet my new team on zoom at the time of the first lockdown. "Welcome to Wimmics and don't forget to turn on your camera." My thesis would also be postponed to the second lockdown. They say timing is everything but I just couldn't believe mine...

The web as an piece of science... and masterpiece? 
Welcome to Wimmics.

© Inria / WIMMICS - Graphe de données du moteur de recherche exploratoire Discovery hub
Graph representing a query on the Discovery Hub exploratory search engine.

In just a few decades of existence, the web has become a system so complex that it requires a multidisciplinary scientific approach. Under Fabien Gandon's leadership, Wimmics (Web-Instrumented Man-Machine Interactions, Communities, and Semantics) is on a mission to bridge formal and social semantics on the web. This joint research team between Inria and i3S (CNRS and Université Côte d'Azur) represents knowledge with graphs to better support actors, actions and interactions in online communities. 

Wimmics members love graphs so much that they end up finding them beautiful - as evidenced by the above data graph representing a query on Discovery Hub, an exploratory search engine leveraging semantic web and linked data technologies.

Year 1: In search of the mysterious "Cannes Banana"

From October 2020 onwards, the ocean of real estate listings published in the Alpes-Maritimes over the past two years would become my bedside reading. My mission, should I accept it: to cook them until they talk. What do they mean, for instance, by "city center"? In Nice, we know it revolves around Avenue Jean Médecin, but is there more to the name? How big is the famous "golden square"? And what should one make of this duplex described as "near the airport": good news or bad news?

We can't imagine all that language has to unveil until we take the time to question it. My first PhD year was thereby entirely devoted to text analysis. Among my numerous finds, I realized that some neighborhoods exist only in the mind and mouth of real estate agents. The "Banana" quarter in Cannes is a perfect example: no map will ever mention the fruit, yet its impact on prices is quite tasty. 

The same phenomenon occurs in the Old Nice: administrative boundaries include Cours Saleya, whereas agents tend to move it to a slightly more upscale area. Is this a mere agent's marketing trick or a tell-tale indication of how the neighborhood is seen by its inhabitants? Both, actually, and this is precisely what interests my company: to survey more than the neighborhood itself, but its very social representations. How is it talked about, who lives there, what lifestyle prevails? This is where true real estate gold lies: in the "real map" of the neighborhood, drawn by the very words of the ones who know it best. 

Bienvenue dans la Banane de Cannes : en jaune, la peau de banane ; en marron, le fruit hors de prix.
Crédit : Inria, i3S, Septeo Proptech
Welcome to the Cannes Banana : the peel in yellow and the pricy fruit in brown.

Year 2: From textual analysis to geography

A new challenge arose with my second PhD year. While I navigated like a fish in the familiar waters of text analysis, transitioning to the geography phase of the project gave me more sweats. Fortunately, I could rely on the superpowers of teamwork as a fellow doctoral student in geography soon joined the team. More than kinship, Alicia Blanchi and I nurtured a genuine and invaluable complementarity. We spent the year empowering each other with the geography and computer science skills needed to pursue our joint investigation.

At times we both felt like the Holmes and Watson of real estate intelligence. To assess, for instance, the social representation of the Promenade des Anglais, we gathered all the listings that mentioned the name and retained only those with actual GPS points - about a quarter of them. Using densities and other fuzzy boundaries, we were able to establish gradual estimates of the area, based on matching degrees. We then repeated the same process to other neighborhoods.

Back to the non-geolocated listings, we started putting multiple location indicators into perspective. For instance, a "3-bedroom flat on Promenade des Anglais, near Place Masséna, 5 minutes from the beach" is highly likely to be located at the intersection of these three densities. The exact address remains unknown, but it doesn't matter: we now have a fine-tuned area based on probability averages, and this is exactly what our real estate agents are looking for.

Fusion de l'information : comment on parvient à estimer finement la localisation d’un bien à l’intersection de trois estimations floues extraites de l’annonce.
Crédit : Inria, i3S, Septeo Proptech
Finetuning a property location at the crossroads of fuzzy boundaries extracted from the ad.

One computer scientist, one geographer, 2 private investigators


I could rely on the superpowers of teamwork as a fellow doctoral student in geography soon joined the team. More than kinship, Alicia Blanchi and I nurtured a genuine and invaluable complementarity. We spent the year empowering each other with the geography and computer science skills needed to pursue our joint investigation.


Lucie Cadorel

Year 3: The winning triples of Web 3.0

Once the information extracted from the text and location densities mapped out, my third PhD year could begin. The new question that would now keep me awake at night was: who is interested in this knowledge and how can we make this knowledge accessible to them?

  • The buyer, of course
  • The real estate agent, our target audience from the outset, who needs to survey the average prices listed in the neighborhood and then compare similar properties along with their listed and selling prices.
  • The geographer, who also takes a keen interest in the social representation of a given neighborhood.

To make the findings of my research accessible to all these users, I would rely on a knowledge graph, which is the core expertise of Wimmics. In the wonderful world of RDF graphs (Resource Description Framework, the basic language of the Semantic Web), all information can be linked together using triples made of a subject, an object, and a predicate linking the latter two. For example: "3-bedroom appartment (subject) located in (predicate) Cannes Banana (object)."

100 K

property listings surveyed

7 M

resulting triples in the RDF graph


Once our 100,000 listings translated into 7 million triples, our next task was to sort them within an ontology, which is a formal representation of the graph needed to make these big data accessible to our human understanding.

Finally, thanks to the Sparql_endpoint we created, anyone could query the graph. For example, you could ask it to select all listings mentioning a 3 bedroom flat under 500,000 euros located in the Banana and close to the beach - the latter criterion being nowhere to be found on mainstream real estate search engines.

Understanding which adjectives are linked to which neighborhoods- is among the semantic studies that are most valuable to both real estate agents and geographers. Our word clouds clearly show that agents are usually short for words when it comes to describing popular neighborhoods. On the other hand, the more expensive the neighborhood, the more there is to say, especially about the location. Conversely, in a more affordable neighborhood, listings will tend to stick to the description of nearby amenities.

Nuages d'adjectifs utilisés pour décrire un quartier : quand la richesse du langage en dit long sur le quartier en question.
Crédit : Inria, i3S, Septeo Proptech
Nuages d'adjectifs utilisés pour décrire un quartier : quand la richesse du langage en dit long sur le quartier en question...

Year 4: On the engineering road again

#IC2022 : Lucie Cadorel remporte le Best Highlight Paper

I am often asked : "Can artificial intelligence write listings on behalf of real estate agents already?" Yes, they can. With a graph like ours, AI could even enhance copy based on the analysis of competing listings. But this specific future won't be mine to write. After earning the Best Highlight Paper award at the Knowledge Engineering Conference in 2022, my research work now lies in the expert hands of SepteoProtech to eventually become a cutting-edge product of real estate intelligence.

As for me, while I truly enjoyed research over the past three years, I could not resist the call of a new industry venture: I just joined Continuity as a Machine Learning engineer. As the saying goes, with a slight twist and turn: "you can take the girl out of engineering, but you can't take engineering out of the girl".