Rencontres Inria Industrie
Exploration of Web Archives
We introduce a new approach to the exploration of web archives, based on a structured set of informations called web fragments that are extracted from the archived HTML pages. A web fragment is a part of a web page. It can be a news article or a blog post and has the particularity to be indexed by its edition date (the time when the fragment was written) instead of its archived date (the time when its parent web page was crawled).
We have built a search engine on top of the web fragments. By focusing on the edition date our search engine is able to search for a specific set of archives with a greater time accuracy. Additionally, the application is able to detect and identify threshold-based events regarding a given request. For the demonstration, we use a corpus of web archives based on the online activities of the Moroccan diaspora.