Publication Date
Spring 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Faranak Abri
Second Advisor
Fabio Di Troia
Third Advisor
Shuyi Wang
Keywords
Emotion Detection, Dimensional Emotion Modeling, Natural Language Processing, Multimodal Analysis, Transformer Models
Abstract
Emotion detection plays a crucial role in human-computer interaction, enabling machines to recognize and respond appropriately to human emotional states. This project explores a two-stage approach to emotion detection using multimodal data, first predicting dimensional values (Arousal, Valence, Dominance) from textual and audio inputs, then mapping these representations to discrete emotion categories. We compare this approach with direct categorical classification using transformer-based language models like BERT, RoBERTa, and DeBERTa for text processing, alongside various audio feature extraction methods including MFCCs and spectrograms. Using the IEMOCAP dataset, we evaluate both approaches across text-only, audio-only, and multimodal configurations. Our findings reveal that while the two-stage approach provides richer emotional representations, direct classification achieves superior accuracy (91.82% with RoBERTa) compared to the two-stage method (90.13% with the same model). Interestingly, text-only approaches slightly outperform multimodal ones, though the gap narrows with optimal fusion strategies. For dimensional prediction, we observe that textual features better capture valence (positive/negative sentiment), while audio features more effectively represent arousal (emotional intensity). This research contributes valuable insights into the tradeoffs between dimensional and categorical approaches to emotion recognition, with implications for applications requiring either maximum classification accuracy or nuanced emotional understanding. The findings suggest that application requirements should dictate the choice between these approaches, with direct classification preferred for accuracy-critical tasks and the two-stage approach for scenarios benefiting from continuous emotional representation.
Recommended Citation
Li, Xiangyi, "Two-Stage Emotion Detection from Multimodal Data" (2025). Master's Projects. 1540.
DOI: https://doi.org/10.31979/etd.mwc6-hguw
https://scholarworks.sjsu.edu/etd_projects/1540