Multimodal Emotion Detection and Analysis from Conversational Data

Abhinay Jatoth, San Jose State University
Faranak Abri, San Jose State University
Tien Nguyen, San Jose State University

Abstract

—Emotion recognition in conversations has become increasingly relevant due to its potential applications across various fields such as customer service, social media, and mental health. In this work, we explore multimodal emotion detection using both textual and audio data. Our models leverage deep learning architectures, including Transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT), Robustly optimized BERT approach (RoBERTa), Audio Spectrogram Transformer (AST), Wav2Vec2), Bidirectional Long Short-Term Memory (BiL-STM), and four fusion strategies that combine features from multiple modalities. We evaluate our approaches using two widely used emotion datasets, IEMOCAP and EMOV. Experimental results show that fusion models consistently outperform single-modality models, with Late Fusion achieving the highest weighted F1-Score of approximately 78% on IEMOCAP using both audio and text.