Can AI Read Emotional Expression?

Published in

The Startup

13 min readDec 5, 2020

A collage of differing persons, with different expressions, classified by way of a convolutional neural network.

Robots are moving beyond the factory floors and into our offices, homes, and lives, becoming assistants. In time they will be our coworkers; in the future, we may find them being our companions and even our friends. We already interact on regular basis with narrow AIs in various guises, such as digital assistants such as Siri or Alexa. At present, the interaction is fairly narrow — we may, for example, ask Siri for a suggested restaurant while traveling in a new city, or about the weather forecast so we can be prepared with appropriate clothing for the day. Other AIs, such as Replika, may even become our companions, engaging us in an ongoing conversation, and becoming familiar with us. Other applications of interactive AI and robots are proposed in domains such as medicine and front line health care.

Most of these modalities are about verbal interactions. Human interaction is, however, not exclusively verbal. We also communicate with one another in a wider bandwidth which includes many non-verbal clues. The tone of voice with which one speaks, gestures, body language, and facial expressions are amongst the other channels by which humans interact. Our facial anatomy allows an amazing range of expressions, which may complement our verbal interactions often involuntarily. Imagine, for example, in conversation with a friend her expression changes — towards sadness, for example — and we may shift to enquire if something we have said may have hurt or distressed her. The course of the conversation shifts as a result.

Can computers — AI and, by extension, sociable robots — interact with us in a similar multi-dimensional fashion? Another way put, can an AI agent attain something akin to “theory of mind,” thereby recognizing the intentions and feelings of a conscious human agent? Can it express concern? One aspect or thread in this question may be: can a computer read emotional expressions that may allow more complex forms of interactions with human beings, recognizing how it affects the listener, for example? Narrowly, this can begin with the basic components of computer vision, detection, and classification.

We will approach two aspects of this narrower problem: a) the detection/classification of emotional states reflected in facial expressions; and b) their incorporation into a real-time system that can participate in an ongoing stream of interaction. Project implementation files may be found here for the coding details.

Detection and classification

One of the recent and successful approaches to facial expression recognition (FER) has been using convolutional neural networks. Though a full discussion of such is beyond the scope of this article, a convolutional network basically consists of several layers that abstract features from visual images (which consist ultimately of matrices of pixel values). Beginning with very abstract characteristics of images, such as vertical or horizontal edges, and building into more complex features, it allows for a neural network to recognize common patterns among a set of images (such as faces).

This in turn allows the network to essentially “learn” to map a set of inputs (images of faces) to outputs (labeling expressions exhibited by those faces). When trained on thousands of labeled examples, the network can then be used to make inferences on new examples. As such it is a form of what has come to be known as “supervised learning”: through the process of training, the network is shown a corpus of examples and gradually adjusts its internal parameters to produce outputs as close as possible to the actual or “ground truth” labeling of those examples. Thus, for example, the model can learn to discern between the patterns of sad, happy, angry, and so forth facial expressions found in photographic images.

Used here to train the model is the “FER2013” dataset. It was compiled originally by Pierre Luc Carrier and Aaron Courville using the Google image search API to collect examples based on a variety of emotion-related keywords, such as ‘angry’ or ‘blissful’, and incorporating a range of ages, genders, ethnicities, and so forth. The collected photographs were then processed using the computer vision API OpenCV to isolate the facial regions, resulting in a series of close-cropped 48-pixel square grayscale images of human faces displaying a variety of expressions. Subsequently from it, Mehdi Mirza and Ian Goodfellow prepared a subset of 35887 images, classified into “Anger” (4953), “Disgust” (547), “Fear” (5121), “Happiness” (8989), “Sadness” (6077), “Surprise” (4002), and “Neutral” (6198) categories.[1]

Class distribution for the Kaggle FER2013 dataset

Through experimentation, Goodfellow estimated that the human level of performance on the dataset (the “Bayes rate” baseline) to be about 65.5%. This dataset was presented then as a machine learning challenge on Kaggle. In the final competition results, the best level of accuracy was approximately 71% on the test dataset using a convolutional neural network, several points above the Bayes rate.[1]

As published on Kaggle, the dataset has three columns: “usage,” “emotion,” and “pixels.” The first column is used originally to mark which samples were used for training, public testing, and private testing in the context of the competition (we will instead just make a simple 80/20% split between training and validation sets). The “Pixels” column contains the image data in the form of a string containing 2304 integer values from 0 to 255. These will be converted into 1-d arrays, and then reshaped as needed for the input model. Normalization is accomplished by dividing the pixel values by 255 so as to range from 0 to 1. Finally, the “emotion” column contains the target classes as ordinal values 0–6.

The model we have utilized here is a variation on the VGGNet architecture, constructed using the Keras library. It comprises 4 convolutional blocks with the number of filters in each block incremented by exponents of 2, from 64 to 128, 256, and 512. Each is followed by a pooling layer essentially downsampling the image, and a normalization layer is also interspersed between each convolutional layer. Following the convolutional layers, we then feed through three fully hidden connected layers (1024, 256, and 64 neurons in each subsequent layer) and an output layer of seven neurons for the seven classes with softmax activation to result in a probability distribution. From that, the final classification is arrived at by selecting the highest probability. Dropout layers also provide an attempt at regularization (though still experiencing overfitting).

_________________________________________________________________ Layer (type)                 Output Shape              Param #    ================================================================= conv2d_8 (Conv2D)            (None, 46, 46, 64)        640        _________________________________________________________________ batch_normalization_8 (Batch (None, 46, 46, 64)        256        _________________________________________________________________ conv2d_9 (Conv2D)            (None, 46, 46, 64)        36928      _________________________________________________________________ batch_normalization_9 (Batch (None, 46, 46, 64)        256        _________________________________________________________________ max_pooling2d_4 (MaxPooling2 (None, 22, 22, 64)        0          _________________________________________________________________ dropout_7 (Dropout)          (None, 22, 22, 64)        0          _________________________________________________________________ conv2d_10 (Conv2D)           (None, 22, 22, 128)       73856      _________________________________________________________________ batch_normalization_10 (Batc (None, 22, 22, 128)       512        _________________________________________________________________ conv2d_11 (Conv2D)           (None, 22, 22, 128)       147584     _________________________________________________________________ batch_normalization_11 (Batc (None, 22, 22, 128)       512        _________________________________________________________________ max_pooling2d_5 (MaxPooling2 (None, 10, 10, 128)       0          _________________________________________________________________ dropout_8 (Dropout)          (None, 10, 10, 128)       0          _________________________________________________________________ conv2d_12 (Conv2D)           (None, 10, 10, 256)       295168     _________________________________________________________________ batch_normalization_12 (Batc (None, 10, 10, 256)       1024       _________________________________________________________________ conv2d_13 (Conv2D)           (None, 10, 10, 256)       590080     _________________________________________________________________ batch_normalization_13 (Batc (None, 10, 10, 256)       1024       _________________________________________________________________ max_pooling2d_6 (MaxPooling2 (None, 4, 4, 256)         0          _________________________________________________________________ dropout_9 (Dropout)          (None, 4, 4, 256)         0          _________________________________________________________________ conv2d_14 (Conv2D)           (None, 4, 4, 512)         1180160    _________________________________________________________________ batch_normalization_14 (Batc (None, 4, 4, 512)         2048       _________________________________________________________________ conv2d_15 (Conv2D)           (None, 4, 4, 512)         2359808    _________________________________________________________________ batch_normalization_15 (Batc (None, 4, 4, 512)         2048       _________________________________________________________________ max_pooling2d_7 (MaxPooling2 (None, 2, 2, 512)         0          _________________________________________________________________ dropout_10 (Dropout)         (None, 2, 2, 512)         0          _________________________________________________________________ flatten_1 (Flatten)          (None, 2048)              0          _________________________________________________________________ dense_4 (Dense)              (None, 1024)              2098176    _________________________________________________________________ dropout_11 (Dropout)         (None, 1024)              0          _________________________________________________________________ dense_5 (Dense)              (None, 256)               262400     _________________________________________________________________ dropout_12 (Dropout)         (None, 256)               0          _________________________________________________________________ dense_6 (Dense)              (None, 64)                16448      _________________________________________________________________ dropout_13 (Dropout)         (None, 64)                0          _________________________________________________________________ dense_7 (Dense)              (None, 7)                 455        ================================================================= Total params: 7,069,383 Trainable params: 7,065,543 Non-trainable params: 3,840

The validation accuracy following training has been on average 66%, similar to the Bayes rate (human performance baseline) of 65.5%: both of which are also significantly above the probability of a purely random guess (roughly 17%). However, the model performs better on for example detecting happiness (the majority class) as opposed to less well-represented classes such as disgust (amounting to only about 1.5% of the examples in the dataset as a whole).

Confusion matrix FER2013 dataset on the validation set. Legend: 0- ‘Angry’, 1-‘Disgust’, 2-‘Fear’, 3-‘Happy’, 4-‘Sad’, 5-‘Surprise’, 6-‘Neutral’

What’s in a label, anyway?

The labeling of the samples utilized in training a supervised learning model — the “ground truth” by which it is refined and finally evaluated — can also be a source of poor performance if it is inaccurate or lacks consistency. Ultimately, a model may only be as good as the human tagging of the training and validation image data used to develop it. In the previously referenced paper, the authors note that “FER-2013 could theoretically suffer from label errors due to the way it was collected…”[1] Cursory examination of the examples indeed can reveal a degree of inaccurately, inconsistently, or ambiguously labeled examples. Indeed, human expressions are likely to be classified differently by different subjects, and emotions may themselves be inherently ambiguous or complex.

Sad, or neutral? An ambiguously labeled example in the FER2013 dataset.

Similarly, Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang, of Microsoft Research, have found that the original FER2013 dataset to be labeled inaccurately.[2] They propose as an alternative a crowdsource based method to more correctly tag the examples. Having determined ten taggers of the images to be an effective quorum, they essentially recorded votes on how each image would be classified, resulting in a probability distribution as to how each could be labeled. This is reflective of the fact that different individuals may “read” the same facial expression differently. They then explored different methods to use this resulting dataset — designated as FER2013New or FER+— as ground truth to train a deep learning model. The simplest method is to use a type of majority rule, selecting the one classification receiving the most votes as the ground truth for that example. This is the approach we have taken.

From the research paper[2] showing the differences in the labeling of the two FER2013 vs. FER+ (or FER2013New) datasets

The FER+ dataset has 8 rather than 7 basic emotion classifications, adding a “contempt” label, and 2 additional classes — “unknown” and “NF.” The last denotes rows that do not have associated images: the original dataset includes the images as strings of pixel values, 0–255, which we converted into 3 dimension tensors (including a dimension for the number of color channels, 1 as they are grayscale) and normalized the values to be from 0 to 1. The authors converted those into image files, but are not published on their Github repository. The CSV files they published are rowed for row parallel with the original FER2013 dataset as published on Kaggle, so we have merged them into one data frame, and updated the target column as the majority vote for each example. Finding that “contempt” and “disgust” can be assumed to be overlapping if not essentially synonymous, we add the votes for “contempt” to the “disgust” class, thereby maintaining the original dimensionality of the model’s output. (There were also very few examples categorized under “contempt” anyway.)

Revised class distribution for the FER2013New dataset (validation set)

Implementing the revised dataset has made a perhaps surprising impact on the results. The difference in labeling the examples marks a significant improvement in the resulting model with identical architecture, going from 66% to 81% accuracy on the validation dataset. A performance improvement of 15%. This suggests that the revised tagging of the examples has made data labeling more consistent, if not still subjective.

Confusion matrix for the FER2013New dataset. Legend: 0:‘Neutral’, 1:‘Happy’, 2:‘Surprise’, 3:‘Sad’, 4: ‘Angry’, 5:‘Disgust’, 6:‘Fear’

Real-time implementation

Ultimately, the value of a machine learning model is its ability to be deployed in some practical manner to make inferences on new, previously unseen examples. In this application, it entails live detection of expressions, in real-time.

To implement a simple application as proof of concept, we utilize OpenCV to both capture frames from a live video stream and to perform face detection to extract the facial expressions. Face detection is accomplished by way of the Viola-Jones method (Haar Cascade classifier) that is built into OpenCV. The resulting region of interest is then resized to 48-pixel square and converted to grayscale, and preprocessed to be fed through the trained convolutional network for inference. On a Windows 10 HP ProBook with a 2.5GHz 64-bit system, 4GB RAM, running TensorFlow version 2.0.0 and OpenCV version 4.4.0, the live detection runs at approximately 9 frames per second.

A less deep network (3 shallower convolutional layers) will result in less validation accuracy (approximately 73%) but runs an increased speed of approximately 13 FPS. There may be a trade-off to be had between accuracy and put through, especially as it can be argued that no one expression has a singularly true classification anyway. This can in turn become one stand in an overall interactive application, integrating other aspects of visual recognition, natural language processing (chatbot, sentiment analysis), and so forth. For such purposes, performance comparable to human performance is sufficient in practice — in fact, one may argue that a machine too perfect in this task may be less than “friendly” towards its interlocutors, and is likely the beginning rather than the conclusion of interaction. It will prompt further inquiry.

Where from here?

The model presented here can be improved in several ways. A glaring drawback is in the imbalance of the dataset (in either form, FER2013 of the revised FER+ version). Predictive models are found to often show bias towards the majority classes given imbalanced distribution in both training and validation datasets. This can be approached using data augmentation, or better yet, accumulating additional examples in the categories that are currently lacking such as “disgust.” Also, the present overfitting of the current model may reduce its ability to generalize in actual use. However, it is also arguable that there is not necessarily one correct classification of any given facial expression — hence the applicability of the majority vote or considering the results of a probability distribution rather than a single answer to each case, as in the authors of the FER+ dataset.

But, in the long run, is just becoming more precise in the simple, one-dimensional classification of expressions sufficient to promote AI/human interaction? Is simply labeling a moment’s expression enough? We suspect not.

Human emotions are complex, and often subtle and ambiguous. A deep learning model may categorize them into seven broad classifications, but there are many shades within each. In a sense, these basic emotions may be akin to primary colors, with many admixtures. Also, they do not occur in a vacuum, but rather occur holistically and in a living stream of both verbal and non-verbal communication. For this, it would be of interest to widen the scope of a model that may effectively enable an AI agent to interact with human agents in varying contexts.

The ultimate goal may well be if an AI — a machine — can attain something to akin a “theory of mind”? Humans, through interacting with others (other human beings, pets or other animals, and apparently even machines), tend to infer a presence of mind or intention within them. Can a machine infer something similar to us? Is there an AI equivalent of empathy, even? Can an AI, a robot, meaningfully interact with us?

In that light, we may explore at least a couple of broad possible directions, by conjecture:

Instead of focusing on supervised learning on classification with defined ‘ground truth’ classes for training examples, approach the problem from an unsupervised, clustering approach (k-means, nearest neighbors, or auto-encoders). See how diverse expressions form their own familial relationships, rather than being coerced into distinct linguistic bins. The latter can serve as coarse classification, but also able to capture the many subtler nuances within each.
Expand into a wider bandwidth of human expression. Facial expressions are just one modality or channel of expression. Others include body language — gesture, and posture — as well as also verbal communication. Expressions are also time-based sequences, not just isolated moments. One approach could be the analysis of video clips — perhaps from movies or shows as training data. A synthesis of facial expression and pose detection combined with NLP analysis of accompanying verbal communication such as tone of voice and sentiment analysis of spoken language, as well as the context within which interaction occurs.

References

Project files may be found at https://github.com/karencfisher/face_express

[1] Challenges in Representation Learning: A report on three machine learning contests. I Goodfellow, D Erhan, PL Carrier, A Courville, M Mirza, B
Hamner, W Cukierski, Y Tang, DH Lee, Y Zhou, C Ramaiah, F Feng, R Li,
X Wang, D Athanasakis, J Shawe-Taylor, M Milakov, J Park, R Ionescu,
M Popescu, C Grozea, J Bergstra, J Xie, L Romaszko, B Xu, Z Chuang, and
Y. Bengio. https://arxiv.org/abs/1307.0414
[2] Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution. Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, Zhengyou Zhang. https://arxiv.org/abs/1608.01041

Facial Expression Recognition Challenge using Convolutional Neural Network https://colab.research.google.com/drive/1XiJ-sa5Kg324mpq_XG_JMWOlfj_DvZFv
Facial Expression Recognition using Convolutional Neural Networks: State of the Art. Christopher Pramerdorfer, Martin Kampel. https://arxiv.org/abs/1612.02903
Facial Emotion Detection Using Convolutional Neural Networks and Representational Autoencoder Units. Prudhvi Raj Dachapally. https://arxiv.org/abs/1706.01509
Emotion Recognition via Facial Expression: Utilization of Numerous Feature Descriptors in Different Machine Learning Algorithms. John Chris T. Kwong, Felan Carlo C. Garcia, Patricia Angela R. Abu, and Rosula S.J. Reyes. https://www.researchgate.net/publication/329290966_Emotion_Recognition_via_Facial_Expression_Utilization_of_Numerous_Feature_Descriptors_in_Different_Machine_Learning_Algorithms
Facial emotion recognition using convolutional neural networks (FERC). Ninad Mehendale. https://link.springer.com/content/pdf/10.1007/s42452-020-2234-1.pdf