Quantifying uncertainty
A classical success story in machine learning (ML) and artificial intelligence (AI) in the last decade is the extraordinary accuracy in prediction and classification task – such as object recognition in images - obtained with large ML models. At the same time, many recent advances and open problems in AI require a deeper integration of ML with probabilistic thinking, including the need to build probabilistic representations of data. One obvious example is quantifying the uncertainty surrounding a prediction or classification. Think, for example, at purely providing a doctor with an AI-based diagnosis of a disease versus providing an assessment of our confidence in such diagnosis.
A popular example of probabilistic modelling in AI is offered by so-called generative models, which witnessed major advances in recent years. Instances include models for the generation of images (including controversial so-called deepfakes), text (from automatic creating captions for images to advanced chatbots), or even art (such as music generation software). The fundamental idea there is to build probabilistic models that are able to generate data that "look like" real data, e.g., photos of people that have never existed but could have existed. They are different from "discriminative" models that learn, for example, to predict which object is shown in a picture (despite poor light, uncommon posture etc) but cannot generate new images where such an object is shown. More generally, probabilistic modelling is not only about generating new "fake" data. It is about learning generative mechanisms, i.e. building models that quantify and potentially reproduce the randomness inherent in data. Such probabilistic representations can help to perform various ML-related tasks such as identifying "unlikely" observations that might need additional information before being able to take a reliable decision; quantifying the uncertainty about a prediction or decision produced by a ML model; detecting outliers and suspicious behaviour; and exploiting the inferred generative model to make inferences about latent structures in the data.
Broadly speaking, probabilistic and generative thinking is widely used across sciences. Although different in their aims and interpretations, fundamental concepts such as latent variables, random effects, factor models, or mixture models that are pervasive across social sciences are actually examples of generative models. A classic example is given by topic models, which allow automatic extraction of meaningful topics from large corpus of text documents or, in other words, allow understanding and characterizing what documents are talking about. These and many other natural language processing techniques have enabled researchers and companies to treat "text as data" to be fed as input for downstream tasks, thus having major impact in many application areas including research in Political Science and Economics. Another common example is probabilistic recommendation, where "products" and "customers" are assumed to possess latent unobserved features that determine the likelihood of a given customer giving a certain rating to a given product. Statistical learning is then used to infer relevant features from observed data and thus build a concise and yet informative representation of products and customers types.
Many generative models, including the examples above, build a probabilistic representation for data x by specifying a joint probability model for x and z, p(x,z), where z are latent variables aimed at modelling fundamental but unobserved sources of variation. In the above examples z would be topics and x words chosen depending on the topic; or z would be customer and product features and x the observed ratings. Learning from such models, either in order to produce new data or to make inference about underlying structures, requires to compute the marginal distribution of data, p(x), or the so-called posterior distribution of latent variables given the observed data, p(z | x). These tasks involve major computational challenges, especially in modern applications with thousands or millions of latent variables in the model. These challenges are usually tackled with one of two main classes of algorithms: variational ones, which build a deterministic and "easier-to-handle" approximation of p(z | x); and Monte Carlo ones, which build a stochastic representation of p(z | x) by appropriately drawn random samples. Providing deeper understanding of the computational and statistical workings of such algorithms in the context of large-scale probabilistic models, as well as developing better and more efficient algorithms, is the focus of my recent ERC Starting Grant for the project "Provable Scalability for high-dimensional Bayesian Learning".
Looking forward, a deeper integration between probabilistic thinking and AI can contribute to tackle key challenges in these fields, ranging from uncertainty quantification to interpretability. A fascinating aspect of current research in probabilistic and generative modelling is that similar frameworks and even algorithms are nowadays increasingly used across very diverse scientific fields. This gives a key importance and responsibility to methodological research in Statistics and ML, which can help recognizing common structures and facilitate the flow of ideas across fields.