Publication Date

1-1-2024

Document Type

Article

Publication Title

IEEE Access

Volume

12

DOI

10.1109/ACCESS.2024.3463400

First Page

136451

Last Page

136465

Abstract

In machine learning class imbalance is a pressing issue, where the model is biased towards the majority classes and underperforms in the minority classes. In textual data, the natural language processing (NLP) model bias significantly reduces overall accuracy, along with poor performance in minority classes. This paper investigates and compares the performance of transformer-based models, such as Multi-head Attention with the data levels and algorithmic levels approaches and BERT (Bidirectional Encoder Representations from Transformers) with LLM-based data augmentation. The research utilized the approaches, such as Random Over Sampler, Synthetic Minority Over-sampling Technique (SMOTE), SMOTEENN, data augmentation at word level, class weights, L2 regularization and leveraging GPT-3.5-Turbo's for data augmentation to create additional data samples in imbalance dataset. The results from the experiment demonstrate that the LLM-based data augmentation with Multi-head Attention and BERT in the Myers-Briggs Type Indicator (MBTI) dataset (a highly skewed dataset) achieves the highest precision, recall and F1 score of 0.76 across terms. It indicates that the LLM-based data augmentation has significant improvements in dealing with class imbalance and improves the model's accuracy in minority class types in the MBTI dataset.

Funding Number

2319802

Funding Sponsor

National Science Foundation

Keywords

BERT, GPT 3.5-turbo, imbalance dataset, LLM, Multi-head attention, Myers-Briggs type indicators

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Department

Computer Science

Share

COinS