In Part I, I discussed widely held concerns over the use of affective video technology such as bias and privacy, to which HireVue has responded in their website. The only concern that is not addressed in their website, although it has come under some scrutiny by both the media and AI Now in their 2019 report, is the claim that it is based on outdated scientific knowledge. Meredith Whittaker, a co-founder of the AI Now Institute, called HireVue’s practices pseudoscience and a license to discriminate. HireVue’s CTO called the criticism uninformed, defending that “most AI researchers have a limited understanding” of the psychology behind how workers think and behave. Nonetheless, even if AI researchers are not IO psychology experts, HireVue’s critics are basing their criticisms on sound scientific evidence which their own so-called psychology experts are ignoring.
The field of science of emotion for the purposes of this critique can be divided into two camps: those following the “common view”, started by Paul Ekman in the 60s, and those following more recent exhaustive studies with Lisa F. Barret as their main advocate. The common view, on which Affective Video technology is based, presupposes 1) that certain emotion categories are routinely expressed by a unique facial configuration and 2) people can reliably infer emotional states from a set of facial movements. Lisa F. Barret and critics of Affective AI technology have tested both assumptions in a recent journal article and find evidence for neither.
HireVue uses reverse inference to assess something about a person’s emotional state that is not directly accessible: They claim to identify emotions based on their correlation with facial configurations. For this correlation to be valid it needs to occur more often than by chance. To then make reverse inference about a person’s emotional state through conditional probability, four criteria must be met: reliability, specificity, generalizability, and validity:
Reliability: whether a scowl can be statistically said to correlate with anger or not. For this we need to know whether anger and a scowl come together often enough. Studies suggest that people sometimes scowl in anger, but not always above chance level. Self-reported anger may also occur without a scowl, making this inference hardly reliable.
Specificity: To infer emotions from facial configurations these must be unique to a specific emotion. However Barret’s review looking at facial configurations, voice, physical symptoms and brain activity found that there are no facial configurations, or combinations, unique to any emotion.
Generalizability: This criteria involves at least two things, whether the scientific findings can be replicated in the real world, and whether or not they apply to non-Western and minority populations. The outcomes of studies for the generalizability of facial expressions is not robust.
Validity: This criteria remains a difficult and unanswered question since it involves being able to demonstrate objectively that even if there was a strong generalizability the person is truly experiencing the emotion. In other words, we need to have a way to verify that the person we are perceiving to be angry is truly angry.
Three out of these four criteria have limited conclusions (reliability, specificity, and generalizability), and we are yet to find a way to show the validity of the studies. Only when a pattern of facial muscle movements strongly satisfies these four criteria can we justify calling it an “emotional expression.” Scientists do agree that facial movements convey a range of information and are important for social communication, but Barret et al’s review suggests that the assumptions on which affective video technology are based are fragile and barely supported.
In other words, the available scientific evidence suggests that people sometimes smile when happy or frown when sad, as proposed by the common view, but how emotions are communicated can vary substantially across cultures, situations or people and “facial expressions” can express instances of different emotions. In fact, a given configuration of facial movements, such as a scowl, often communicates something other than an emotional state.
Whether or not we find Barret et al’s analysis of findings and their own research conclusive or not, they show enough evidence that none of the ideas behind current affective video technology stand up to scientific scrutiny. Affective video technology relies on commonplace understandings of emotions which are neither scientifically nor philosophically reliable. Because the views are still so commonplace, however, they go largely unexamined in the enthusiastic rush towards new tech, all the while yielding social harms: real human beings have their careers thwarted based on the misguided belief that we can simplistically measure human emotions in the face, body, or voice. No one would want to board a spaceship built on a pre-Galilean understanding of the universe, but this is precisely what over 700 companies have done, using HireVue technology.
Recommendations and Conclusion
Lisa F. Barret et al leave us with the following list of recommendations for further research:
New research on emotion should consider sampling individuals deeply, with high dimensional measurements, across many different situations, times of day, and so forth: a Big Data approach to learning the expressive repertoires of individual people. The diagnosis of an instance of emotion might be improved by combining many features, even those that are weakly diagnostic on their own, particularly if the analysis is conducted in a person-specific (idiographic) way.
Only a highly multivariate set of measures is likely to work to classify instances of emotion with high reliability and specificity. This means looking at more than physical clues. In principle, rich, multimodal observations could be available from videos; when time-synchronized with the other physical measurements, such video could be extremely useful in understanding the conditions when certain facial movements are made and what those movements might mean in a given context. Naturally, Big Data in the absence of hypotheses is not necessarily helpful. (And this raises issues of privacy.)
Participants could be offered the opportunity to annotate their videos with subjective ratings of the features that describe their experiences (whether or not they are identified as emotions). Candidate features are affective properties such as valence and arousal, appraisals (i.e., descriptions of how a situation is experienced), and emotion-related goals. These additional psychological features have the potential to add higher dimensional details to more specifically characterize facial movements and what they mean. Such an approach introduces various technical and modeling challenges, but this sort of deeply inductive approach is now within reach.
Although affective video technology is not necessarily a dead-end project, it has a long way to go before it is ethically viable since it cannot even do what it purports to do. Understanding how to, and whether it is even possible, to infer someone’s emotional state or predict someone’s future actions from their facial movements is not yet within reach. Furthermore, given the fact that this affects in real, tangible ways, the lives of individuals and society as a whole it is unethical and irresponsible to claim to offer those services now. Its major impediment is the science behind it, but it also needs to answer the important concerns regarding privacy and how to best avoid recreating societies’s biases and potentially creating new ones.