Synthesizing Big, Wide and Dirty Data to Predict Election Outcomes
After the 2008 financial crisis, 45 new parties were created all over Europe to capitalize on the discontent of voters towards austerity policies. These new parties – ranging from extreme left to extreme right – got a significant number of seats (18,3% of the total in 2016). In such a scenario, predicting election outcomes is even more important than usual, but is also more difficult because of the absence of historical data on these emerging parties.
In most European countries, there are three different data sources which can be informative about elections. The first comprises the opinion polls that are published by media outlets and institutions during the election campaign, even up to a couple of days before the election. These data are abundant, but also subject to a number of biases and are only informative about the average national sentiment, whereas – in most European countries – seats are allocated according to votes in local districts. A second source of information is represented by activities on social media, which are harder to collect and potentially even more biased. These two sources of information are examples of "big" data, at least in comparison with the third one, which consists of more carefully designed surveys with geographically and demographically stratified samples. These surveys are carried out by national institutes and report a number of voter characteristics in addition to their voting intention. In this sense, these can be called "wide" data. However, among other issues, such surveys are conducted months before the election, hence do not capture the shifts that may occur near election day. Considering all three of these sources of information, data for predicting election outcomes are abundant but "dirty" and coming from heterogeneous sources. None of these data sources is fully informative when considered in isolation, but their synthesis is. However, the problem of synthesising these data is far from trivial.
Omiros Papaspiliopoulos, a new Full Professor at Bocconi (Department of Decision Sciences), started working on this problem in 2015, together with José Garcia-Montalvo (UPF), a prominent applied economist, and Timothée Stumpf-Fetizon (Warwick), then just graduated from the Master in Data Science at Barcelona Graduate School of Economics which Professor Papaspiliopoulos founded in 2013 and directed until 2020. Besides data-warehousing and modelling challenges, this project posed serious computational problems, which turned out to be of much broader relevance, more generally applying to high-dimensional sparse data and models. Professor Papaspiliopoulos, whose primary expertise is in statistical and computational methodology, recognised the common structures and – in joint work with Gareth Roberts (Warwick) and Giacomo Zanella (Assistant Professor at Bocconi) – developed new computational approaches that are provably scalable, that is whose running time increases only linearly with the amount of data and the size of the model. This makes them practical for use in large scale applications.
Such an interaction between applied problems and methodological innovations is far from unusual in Papaspiliopoulos's works. "I think that real-data projects like this are very stimulating for us as statisticians: they make us think about new problems and – hopefully – design new solutions. The same holds for consulting projects, which push us out of our academic comfort zone and can help us being more pragmatic. Of course, this does not mean being less rigorous. On the contrary, scientific rigour is our identity and potential contribution, both to the private and public sector. Especially when predictions are intended for policy makers, it is key that the methodology used is transparent, interpretable, and scientifically justified."
Find out more
Montalvo, J. G., Papaspiliopoulos, O., & Stumpf-Fétizon, T. (2019). "Bayesian Forecasting of Electoral Outcomes with New Parties' Competition." European Journal of Political Economy, 59, 52-70. https://doi.org/10.1016/j.ejpoleco.2019.01.006
Papaspiliopoulos, O., Roberts, G. O., & Zanella, G. (2020). "Scalable Inference for Crossed Random Effects Models." Biometrika, 107(1), 25-40. https://doi.org/10.1093/biomet/asz058
Papaspiliopoulos, O., Stumpf-Fetizon, T. & Zanella, G. (2021) "Scalable Computation for Bayesian Hierarchical Models," preprint on arXiv, 1-48, https://arxiv.org/abs/2103.10875