Publication Date
8-18-2025
Document Type
Article
Publication Title
Journal of Advances in Information Technology
Volume
16
Issue
8
DOI
10.12720/jait.16.8.1127-1141
First Page
1127
Last Page
1141
Abstract
—Emotion recognition in conversations has become increasingly relevant due to its potential applications across various fields such as customer service, social media, and mental health. In this work, we explore multimodal emotion detection using both textual and audio data. Our models leverage deep learning architectures, including Transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT), Robustly optimized BERT approach (RoBERTa), Audio Spectrogram Transformer (AST), Wav2Vec2), Bidirectional Long Short-Term Memory (BiL-STM), and four fusion strategies that combine features from multiple modalities. We evaluate our approaches using two widely used emotion datasets, IEMOCAP and EMOV. Experimental results show that fusion models consistently outperform single-modality models, with Late Fusion achieving the highest weighted F1-Score of approximately 78% on IEMOCAP using both audio and text.
Funding Number
2319803
Funding Sponsor
National Science Foundation
Keywords
conversational data, emotion recognition, fusion models, multimodal Learning, Wav2Vec2, —Bidirectional Encoder Representations from Transformers (BERT)
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Department
Computer Science
Recommended Citation
Abhinay Jatoth, Faranak Abri, and Tien Nguyen. "Multimodal Emotion Detection and Analysis from Conversational Data" Journal of Advances in Information Technology (2025): 1127-1141. https://doi.org/10.12720/jait.16.8.1127-1141