Contacts
IGIER Visiting Student Yanduo Chen on a seminar by Sendhi Mullainathan: machine learning techniques could advance scientific research by supporting better hypothesis generation

Every day we meet many people and look at their faces. Could you ever catch their heart pulse by watching their facial blood vessels? You may try, but you won't succeed. Advanced machine learning algorithms can accomplish the task perfectly and there are even apps on mobile phones that measure your heart pulse through a camera. Algorithms can capture patterns in data undetectable by humans. Given this, can machine learning help Scientists come up with new hypotheses based on new patterns in the data?

At the IGIER seminar of October 4th, Sendhil Mullainathan, the Roman Family University Professor of Computation and Behavioral Science at Chicago Booth, presented his new research on algorithmic behavioral science with Jens Ludwig. He believes hypothesis generation could be done more systematically with the help of algorithms and illustrated his idea through a real case on Judicial Biases.

In the US today, approximately 12 million people are arrested every year. Upon arrest, the suspects should wait for trial either at home or in jail. The waiting period takes 2-3 months (sometimes even as high as 9-12 months) and the decisions of where to stay are made by a judge. Obviously, a biased decision could have very negative outcomes: bad people flee or commit crime again at home, while good people stay in jail and lose their jobs. Machine learning can help detect if judicial decisions are biased or not.

Before going any further, we should recall that machine learning is an excellent tool for predicting outcomes such as recidivism and judicial decisions. Mullainathan and Ludwig built two algorithms to predict the risk of recidivism and the judge decisions respectively. The key finding is that there is misalignment in the two algorithms: judicial decisions are predicted by factors other than objective recidivism risk.

To understand what are these "biasing" factors, Mullainathan and Ludwig use a smart permutation exercise: they replace a given feature x of observation i with the value x' of the same feature for a randomly drawn observation j. Such distortion in data structure would cause a drop in prediction performance, and the drop quantifies the explanatory power of feature x. It turns out that, when one performs this exercise, the scans of a defendant's face explain the largest chunk of variation in judicial decisions. Two follow-up tests are carried out, where face pixels alone predict judge behavior significantly better than random pick, and face pixels explain between 33% and 50% of predictable variation.

Is this finding rediscovering known biases through faces? Thanks to the contributions of excellent research in the past, we have substantial knowledge about what could be interpreted from a face. Through carefully designed experiments, Mullainathan and Ludwig reached the conclusion that algorithm is not rediscovering demographic stereotypes, skin tone bias, current charge bias or known face psychology. Is, however, judging someone from face so intuitive that is trivial to be modelled explicitly? Through behavioral experiments, they also rule out this explanation.

Face pixels are telling us something new, but what is it? Machine learning is a black box subject to interpretability challenges, but new approaches are emerging to dig deeper. Mullainathan and Ludwig applied a state-of-the-art method, where they form gradients and project it onto faces to form new faces with the help pf Generative adversarial networks (GAN). Through this process, they could not only visualize the different features caught by the algorithm, but isolate the categories of differences between faces – age, color, etc. After running a judgement experiment with the formed faces, they uncover a new factor "well groomed": over 30% of the participants now identify "well groomed" as a factor they take into consideration.

At this stage, a new hypothesis is successfully made. The rest of the test is as normal as in other scientific research. They take ratings of a set of images on "well-groomed" and regress them over judge decisions. The effect is strong even after controlling for other characteristics. After certain extra tests on endogeneity issues, nice research is done. More appealingly, one has only captured a part of the algorithm's signal and more hypotheses could be proposed by iterating the same process.

The key takeaway of this seminar is that hypotheses could be proposed without introspection on the data. All procedures generating the testable hypothesis could be transformed for other settings. Hypothesis generation could thus be as formal as hypothesis testing, and more research could be done without the limit of cognitive apparatus.