Publication Date

8-18-2025

Document Type

Article

Publication Title

Journal of Advances in Information Technology

Volume

16

Issue

8

DOI

10.12720/jait.16.8.1127-1141

First Page

1127

Last Page

1141

Abstract

—Emotion recognition in conversations has become increasingly relevant due to its potential applications across various fields such as customer service, social media, and mental health. In this work, we explore multimodal emotion detection using both textual and audio data. Our models leverage deep learning architectures, including Transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT), Robustly optimized BERT approach (RoBERTa), Audio Spectrogram Transformer (AST), Wav2Vec2), Bidirectional Long Short-Term Memory (BiL-STM), and four fusion strategies that combine features from multiple modalities. We evaluate our approaches using two widely used emotion datasets, IEMOCAP and EMOV. Experimental results show that fusion models consistently outperform single-modality models, with Late Fusion achieving the highest weighted F1-Score of approximately 78% on IEMOCAP using both audio and text.

Funding Number

2319803

Funding Sponsor

National Science Foundation

Keywords

conversational data, emotion recognition, fusion models, multimodal Learning, Wav2Vec2, —Bidirectional Encoder Representations from Transformers (BERT)

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Department

Computer Science

Share

COinS