Author

Xiangyi Li

Publication Date

Spring 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Faranak Abri

Second Advisor

Fabio Di Troia

Third Advisor

Shuyi Wang

Keywords

Emotion Detection, Dimensional Emotion Modeling, Natural Language Processing, Multimodal Analysis, Transformer Models

Abstract

Emotion detection plays a crucial role in human-computer interaction, enabling machines to recognize and respond appropriately to human emotional states. This project explores a two-stage approach to emotion detection using multimodal data, first predicting dimensional values (Arousal, Valence, Dominance) from textual and audio inputs, then mapping these representations to discrete emotion categories. We compare this approach with direct categorical classification using transformer-based language models like BERT, RoBERTa, and DeBERTa for text processing, alongside various audio feature extraction methods including MFCCs and spectrograms. Using the IEMOCAP dataset, we evaluate both approaches across text-only, audio-only, and multimodal configurations. Our findings reveal that while the two-stage approach provides richer emotional representations, direct classification achieves superior accuracy (91.82% with RoBERTa) compared to the two-stage method (90.13% with the same model). Interestingly, text-only approaches slightly outperform multimodal ones, though the gap narrows with optimal fusion strategies. For dimensional prediction, we observe that textual features better capture valence (positive/negative sentiment), while audio features more effectively represent arousal (emotional intensity). This research contributes valuable insights into the tradeoffs between dimensional and categorical approaches to emotion recognition, with implications for applications requiring either maximum classification accuracy or nuanced emotional understanding. The findings suggest that application requirements should dictate the choice between these approaches, with direct classification preferred for accuracy-critical tasks and the two-stage approach for scenarios benefiting from continuous emotional representation.

Available for download on Monday, May 25, 2026

Share

COinS