
"Those who ignore statistics have
to reinvent it." This quip, freely adapted from Bradley Efron,
one of the major statisticians of the end of the twentieth century,
illustrates the importance that statistics has acquired in modern science.
Increasingly, research scientists have to study complex processes that
involve random phenomena, that is to produce data that vary due to chance,
in order to explain and act on the world. We try for instance to model
and predict the weather, identify genes, interpret images and so on.
Statistics in the broader sense covers the whole spectrum of techniques
used to study random phenomena. These techniques call upon probability
theory to design laws capable of modeling this type of phenomena. Statistics
strictly speaking is the experimental side of probability theory. It
analyzes and interprets the data supplied by random phenomena. Lastly,
statistical inference analyzes experimental or observational data in
order to estimate their adequacy to probabilistic models.
Using these techniques, scientists build theoretical models that are
then confronted with data in order to be optimally adjusted. Depending
on the field, this phase of the modeling work is either called learning,
parameter estimation or identification, model approximation, or data
assimilation. Statistics makes it possible not only to take into account
the noise present in the data, but also the heterogeneity of the models
studied and the atypical values that may occur. In the following articles,
we will see that the complexity of the situations to be modeled led
to the development of sophisticated statistical methods that go beyond
the traditional framework of homogeneous, independent data sharing the
same law. For example, hidden structure models (especially Markov chains
and fields) play a major role in signal processing and image analysis.
Another essential aspect of the use of statistics is that it makes
it possible to assess the realism or efficiency of a model by supplying
an evaluation of its stability and precision. Statistical inference
plays a central role in this regard by allowing the construction of
model evaluation measures that simultaneously take into account sampling
fluctuations in data acquisition and the model's capability to
represent reality. Based on measures of the adequacy of data to probabilistic
models, inference provides a sound theoretical framework for evaluating
the quality of the results produced and thus selecting a significant
model or a high performance analysis method. Statistical tests, resampling
techniques such as cross-validation or bootstrap-that are very
much used in statistical learning-the Bayesian approach that leads
to the construction of model selection criteria that compensate a model
adjustment quality with a measure of its complexity, these are powerful
statistical inference tools to accomplish such a difficult task. In
the field of data mining or statistical learning, for example, statistics
plays an important role in the validation of methods that were designed
in a non random context, such as function approximation (neural networks,
support vector machines (SVM)), or in a purely geometric framework.
Contact :
Gilles Celeux,
SELECT team, INRIA Futurs,
Tel. : +33 1 69 15 57 77