How Scientists are Teaching Computers to Read Emotions

Esha Desai '28 & Kate Choi '28

Revolutionizing Human-Technology Interactions: Computers that “Read” Emotions Kate Choi ’28, Esha Desai ’28
Emotions convey human desires and reactions, which often relate to everyday occurrences. They are the intangible words spoken between people to exchange information. According to the American Psychological Association, emotions represent a complex reaction involving experiential, behavioral patterns and physiological responses (American Psychological Association, 2021). This definition can be broken down into three aspects of emotions. First, emotions play a huge role in adaptation to new environments, regulating attention, memory, and perception. Second, communities use emotions as a tool for connection, with 60% to 80% of communication taking place nonverbally (Schlicher et al., 2025). Languages employ emotions to communicate accurate ideas and nuanced connotations, even with the same word. Lastly, emotions are used in decision-making, usually in the “fast” mode of thinking. “System 1” is the unconscious mode, useful in making day-to-day, reversible decisions relying on personal experience and emotions (Pei et al, 2024). The critical roles of emotions open up the field for neuroscientists, psychologists, engineers, and social scientists to study the unspoken language of emotions. In fact, scientists are currently teaching computers to “read” emotions.

The researchers’ first question was: How can machines become emotionally intelligent? The answer lay in affective computing. Rosalind Picard first introduced the term in her 1997 book, “Affective Computing,” which introduced the idea of computers with emotional “awareness.” Affective computing, also known as artificial emotional intelligence, embodies a field which develops AI that can recognize, interpret, and respond to human emotions. The goal of affective computing is to create more natural and empathetic human-computer interactions (HCIs) through the recognition of facial expressions, voice, and physiology, such as heart rate or
gestures (Ezzameli & Mahersia, 2023). With the growth of the technology industry, there has been an increasing focus on training computers to identify emotions, which can improve HCIs and support well-being monitoring. Additionally, computers with the ability to “read” emotions may improve adaptive systems in education, healthcare, and robotics (Kohkina, 2025).

One vital feature of affective computing is computer vision, especially when it comes to facial expression recognition. A Convolutional Neural Network (CNN) is a machine learning model that can process image and video inputs. Its convolutional layers and kernels analyze the input dataset, filtering the image and learning patterns. To train the model, datasets containing labeled facial expressions for basic emotions can be used, such as FER2013, AffectNet, and CK+. Many CNN models achieve more than 90% accuracy for classifying basic emotions in controlled settings (El Bahri et al., 2025). By working with only visual data sets, however, the models often require long training cycles and have poor accuracy when applied to complex environments. Long training cycles are bound to happen as data sets contain 330,000+ images and videos (Ultralytics). Yet, applying CNN models to uncontrolled environments such as classrooms, work environments, and other types of learning areas often limits the accuracy of the results, as the model’s training dataset uses limited images. To counter these challenges, researchers are developing more robust models by expanding training datasets and testing the models under different environments.

Another important feature of affective computing is speech emotion recognition, where the computer extracts emotional cues from the voice’s pitch, intensity, and spectral patterns. Promising methods used to train computers for this task include a combination of CCNs and Recurrent Neural Networks (RNNs) like Long Short-Term Memory (LSTM), which are designed to process sequential data (Begazo et al., 2024). Using the right dataset is crucial for speech
emotion recognition because there are three types of audio databases for emotion recognition: simulated, induced, and natural (Schlicher et al., 2025). While simulated and induced datasets are similar to real-life emotions, they are artificially prompted rather than spontaneously experienced, which may lead to high error rates in real-world applications. Natural datasets also have their downsides due to the fluidity of spontaneous emotions, making it difficult to discern a specific emotion.

In a study done by researchers from the ESTIA Institute of Technology in Valencia, Spain, researchers tested whether integrating two CNN networks, CNN-1D and CNN-2D would produce more accurate results with emotion detection in a controlled environment with two features: spectrogram images and spectral features. Spectrogram images are visual and electronic representations of sound and other signals; CNN machine learning models can be trained on spectrograms like MFCCs to classify emotions from voice. On the other hand, spectral features are distinct patterns extracted from data, showing frequency and energy distributions to help neural networks analyze and classify emotion in speech. In the study, Model 1 used 162 spectral features taken from voice signalling, attaining a 93.29% accuracy in training and 89.10% in validation (Schlicher et al., 2025). While the difference between the accuracy in training and validation is not significantly different, larger phases of training and validation may exacerbate the difference. Specifically, Model 1 had difficulty identifying the emotion “fear” in the validation phase, with only a 79% accuracy, contributing to the varying accuracy percentage in the training and validation phases. Contrarily, Model 2 only used spectrogram features as its input. Yet, similar to Model 1, Model 2’s accuracy scores for the validation and processing differed slightly, at 88.62% and 92.55% respectively. Surprisingly, the recognition of “fear” had a high success rate with 93%; the model, however, struggled with identifying “sadness” with an
85.16% accuracy (Schlicher et al., 2025). Consequently, the researchers predicted that combining both models, each with its own benefits, may produce more accurate results. This prediction was confirmed when Model 3 had close accuracy between the validation and training phases, at 97.48% and 96.51% respectively. Through the use of both spectrograms and spectral features, the new model had a Weighted Accuracy (WA) score of 97%, the highest out of all three models.

Furthermore, even when scaling up the validation and training phases, the WA score remained at 96%. The emotion “sadness” had a 96.23% success rate, while the emotion “disgust” had a 95.99% success rate, which can be attributed to the use of the different types of emotion representations (Schlicher et al., 2025). Diversification of the dataset was key to Model 3’s success; hence, incorporating multimodal representations proves critical to the future of affective computation. The results of this study have many practical implications in fields such as mental health, education, and advertising.

Education, whether face-to-face or online, uses technology to communicate information. While the current technology is helpful, teachers can further benefit from understanding students' behavior in order to teach them more effectively. Researchers from the Private University of Fez explored whether specific emotions in students correlated with higher academic achievement. In their study, they used Learner Emotions, a web application that analyzes students’ facial expressions through a front-facing camera. This software uses machine learning to identify emotions with an accuracy of 0.91 (Llurba and Palau). Although observing individual students for multiple hours a day is not practical, the researchers were able to conduct this experiment through incorporating technology that can detect emotions, highlighting affective computing as an essential research tool in education.

Despite the benefits it brings, affective computing still has its limits. Affective computing often deals with personal characteristics like faces, voices, feelings, and emotions. Users, therefore, often are hesitant in providing personal information, especially as public awareness surrounding internet footprint and data privacy increases. Additionally, the sharing of personal information to computer models raises privacy concerns, making the development of privacy-preserving algorithms essential before affective computing could enter the mainstream. Transparency is one way to approach this issue. Given the complex algorithms of this neural system, it is important to understand how the different models are merged to deduce their effectiveness, ensuring transparency and therefore credibility. Furthermore, real-time processing presents another concern for affective computing. Trimodal or multimodal processing may be time-consuming because it is computationally intensive, so efficient processing and scalability are needed to make this technology practical for real-life applications. Consideration for the computer’s effectiveness and accuracy, however, also needs to be taken into account, which makes improving real-time processing difficult.

In conclusion, affective computing enables machines to detect and respond to human emotions through cues like facial expressions, speech, and physiological signals, enhancing the interactions between humans and computers. Advances in methods such as CNN-based facial analysis, speech emotion recognition, and multimodal systems have demonstrated that computers can achieve high accuracy in controlled environments. Despite these appealing features, however, real-world application of the technology remains a challenge with concerns facing the user’s data privacy and the computer’s ability to process inputs in real-time. As a response, researchers have made improvements in dataset diversity, transparency, and privacy-preserving algorithms, allowing affective computing to revolutionize future human-technology relations.

References

Alsaadawi, H. F. T., Das, B., & Das, R. (2024). TAC-Trimodal Affective Computing: Principles, integration process, affective detection, challenges, and solutions. Displays, 83, 102731. https://doi.org/10.1016/j.displa.2024.102731
Computers have emotions too: New research shows AI can teach technology to recognise emotions with 98% accuracy. (2023, May 26). Www.brunel.ac.uk.
https://www.brunel.ac.uk/news-and-events/news/articles/Computers-have-emotions-too New-research-shows-AI-can-teach-technology-to-recognise-emotions-with-98-accuracy El Bahri, N., Itahriouan, Z., & Ouazzani Jamil, M. (2025). Emotion-Aware Education Through Affective Computing and Learning Analytics: Insights from a Moroccan University Case Study. Digital, 5(3), 45. https://doi.org/10.3390/digital5030045
Koshkina, D., & Koshkina, D. (2025, November 6). Data Visualization & Affective Computing. Design That Manipulates Emotions or Design That Helps Reflect on Emotions?, Nightingale. Nightingale.
https://nightingaledvs.com/data-visualization-affective-computing/
Llurba, C., & Palau, R. (2024). Real-Time Emotion Recognition for Improving the Teaching–Learning Process: A Scoping Review. Journal of Imaging, 10(12), 313. https://doi.org/10.3390/jimaging10120313
Pei, G., Li, H., Lu, Y., Wang, Y., Hua, S., & Li, T. (2024). Affective Computing: Recent Advances, Challenges, and Future Trends. Intelligent Computing, 3.
https://doi.org/10.34133/icomputing.0076
Raj, R., & Demirkol, I. (2025). An improved facial emotion recognition system using convolutional neural network for the optimization of human robot interaction. Scientific Reports, 15(1). https://doi.org/10.1038/s41598-025-22835-0
Schlicher, M., Li, Y., Murthy, K., Sun, Q., & Schuller, B. W. (2025). Emotionally adaptive support: a narrative review of affective computing for mental health. Frontiers in Digital Health, 7, 1657031–1657031. https://doi.org/10.3389/fdgth.2025.1657031