Contacts
Opinions

Understanding the uncertainty in AI

, by Botond Szabo - professore associato di Statistica
Modern machine learning methods use computational shortcuts to handle complex models and large data sizes. This inevitably results in information loss. To confidently use such approaches a deep mathematical understanding is necessary and it is especially important to rigorously quantify the uncertainty of the procedure

Machine and statistical learning are at the core of artificial intelligence, where the goal is to extract knowledge from the data and learn from it. Modern applications require complex models and the available real world data are never perfectly clean or accurate, often containing measurement and other errors making the problem even more difficult. Statistics is the science to analyse and interpret such noisy, imperfect data and it plays a leading role in all modern data-centric developments.

In particular, in recent years, the amount of available information has increased substantially and the models describing real world phenomena are becoming also increasingly more complex. These introduce new challenges for data scientists, since despite the ever-increasing power of computers, the computational complexity became overly large, making it impractical or even impossible to carry them out in reasonable amount of time (or memory requirement). Therefore, novel, modern statistical and machine learning methods were developed to speed up the computations using simplified models and computational shortcuts. However, these methods are often used as black box procedures without rigorous mathematical understanding. This could result in misleading and wrong answers without us even realizing it. A particular example are neural networks, which are the state-of-the-art approaches for image classification with applications ranging from medical imaging to self-driving cars. However, it was shown that minor changes of the input images (which couldn't even be detected by human eyes) or unusual positions of the objects could result in completely inaccurate classifications leading to wrong diagnosis or incorrect detection of objects.

Therefore, it is of high importance to study their theoretical properties and derive guarantees but also limitations for these modern learning methods. One particularly important aspect is to understand how much we can rely on the derived results. In more formal terminology it is essential to correctly assess the uncertainty of the procedure, which is based on noisy, real world data, so can never be perfect. A principled way of obtaining uncertainty quantification is by using Bayesian methods. Bayesian statistics provides a natural way of incorporating expert knowledge into the model in a probabilistic way and automatically quantifies the remaining uncertainty of the procedure. Bayesian statistics is becoming increasingly popular in machine learning and artificial intelligence, for instance in Natural Language Processing for constructing chatbots often Bayesian approaches (naïve Bayes classifiers) are used to find the most likely answer.

My ERC Starting grant focuses in particular on the theoretical understanding of statistical and machine learning methods, including the accuracy of parallel computing methods and the information loss incurred by considering simplified models instead of the accurate, complex models. Then, based on the theoretical understanding I aim to propose new approaches which have higher accuracy. The main focus on my work is on mathematical statistics and its intersection with machine learning, information theory and numerical analysis. I am also involved occasionally in more applied projects, which build on the theoretical insights from my core research.

Working closely with scientists at the Psychology Institute of Leiden University we have developed a learning method aiming for detection of Alzheimer disease. In medical research, often different type of data is collected and combined to provide the best diagnosis. For instance, for early diagnosis of Alzheimer's disease structural and functional MRI data, questionnaire data, EEG data, genetic data, metabolomics data,... etc can be collected. These data are substantially different both in overall size and quality. To achieve the most accurate early diagnosis one should find the most important features in these data sets and combine them in an optimal way. Furthermore, since these diagnostic tools can be expensive and of limited capacity, it is of important to select the most relevant ones to achieve a reliable, accurate and cost effective diagnostic method. We have developed a learning approach called Stacked Penalized Logistic Regression (StaPLR), which selects the most relevant diagnostic tools and the corresponding most relevant features for predicting to early onset of dementia. This method was successfully applied on clinical data containing patients with Alzheimer disease and a control group.