Contacts

Igor Pruenster Keeps a Straight Course in the Sea of Machine Learning and Data Science

, by Claudio Todesco
The Director of the Bocconi Institute for Data Science and Analytics defends research rigor even when businesses value speed over accuracy in their data analysis

"These are great times for Statistics", Igor Pruenster says. He is Full Professor at the Department of Decision Sciences and Director of the Bocconi Institute for Data Science and Analytics (BIDSA). "The new field of data science, which has Statistics at its core, is pivotal to many success stories in a variety of application areas". It is therefore no wonder that an influential journal such as Operations Research has established a new area dedicated to "Machine Learning and Data Science", and Igor has been appointed as Associate Editor, a position he already holds also for several other journals including The Annals of Statistics, the premier mathematical statistics journal. "These days Data Science is hugely popular. Businesses want quick answers to their needs and this can sometimes happen at the expense of accuracy. It is therefore essential to preserve a rigorous and principled approach to research. This is the spirit of BIDSA and of the Master of Science in Data Science and Business Analytics that will be launched in the fall".

Complex models for a complex world
The problem of complexity is ubiquitous in modern science. As phenomena object of investigation become increasingly complex, on the one hand more elaborate models are required to describe them, and on the other hand data exhibit richer and more sophisticated structures. This double challenge often requires a shift from a parametric to a non-parametric approach. The latter gives us the chance to flexibly estimate functional objects such as the distributions of topics and words in text collections or risk curves in medical contexts. Sometimes, though, the models are "black boxes" whose functioning is obscure. The gap between what can be implemented computationally and the theoretical knowledge of the models' properties is widening. Igor Pruenster's work aims to bridge this gap by trying to discover the deep structures of these models. How do they work? "Sometimes it turns out that they don't work. A clear understanding of the theoretical properties of models and algorithms cannot be eluded. Of course, mistakes in medical contexts have a completely different impact compared to errors in the tech industry, such as displaying a wrong advertisement".

Populations and inference
Igor Pruenster's early works are focused on the development of rigorous prediction schemes, which are flexible enough to describe the characteristics of a specific population. For instance, the genetic diversity of a DNA library can be investigated by estimating the growth rate of the new genes to be discovered through additional samples. "Traditional modeling implicitly assumed a logarithmic discovery rate", says Igor, who addressed this topic also in the ERC funded project New Directions in Bayesian Nonparametrics. "Richer and more flexible models have enabled us to describe essentially every possible growth rate, making the methodology well suited in many fields, not only genomics". In the following papers, Igor studied the relationship between different populations, which, although distinct, produce similar data – in our example, DNA libraries of different parts of the same organism. "We flexibly model how populations depend on each other and by doing this we increase the power of prediction and estimation".

The future is unwritten
There is still much to be done – think about the ability to run algorithms described by complex models on massive data sets. "The industry calls for ready-to-use solutions. Academic research must preserve its medium to long term horizon and develop the best possible models, even if they will be implemented in ten years time. It is also true that the computational power is growing at a dizzying rate: the algorithms that today give results in a week in the future will perform the same task in a few seconds".

Another key issue for the future is the reproducibility of research results. The idea that the computer code must be included in a paper with a significant computational component, just like you would include the proof of a theorem, has been spreading only recently in the scientific community. "Overall, this is an exciting time to do research in this field. And it is precisely in times like these that we must not stray from rigorous research principles".

Find out more
Antonio Lijoi, Ramsés H. Mena, Igor Pruenster, Bayesian nonparametric estimation of the probability of discovering a new species, in Biometrika, 94, 769-786, 2007.

Antonio Lijoi, Bernardo Nipoti, Igor Pruenster, Bayesian inference with dependent normalized completely random measures, in Bernoulli, 20, 1260-1291, 2014.

Antonio Canale, Antonio Lijoi, Bernardo Nipoti, Igor Pruenster, On the Pitman-Yor process with spike and slab base measure, in Biometrika, 104, 681-697, 2017.

Federico Camerlenghi, Antonio Lijoi, Peter Orbanz, Igor Pruenster, Distribution theory for hierarchical processes, in The Annals of Statistics, forthcoming 2018.