Publication Date
1-1-2024
Document Type
Article
Publication Title
IEEE Access
Volume
12
DOI
10.1109/ACCESS.2024.3463400
First Page
136451
Last Page
136465
Abstract
In machine learning class imbalance is a pressing issue, where the model is biased towards the majority classes and underperforms in the minority classes. In textual data, the natural language processing (NLP) model bias significantly reduces overall accuracy, along with poor performance in minority classes. This paper investigates and compares the performance of transformer-based models, such as Multi-head Attention with the data levels and algorithmic levels approaches and BERT (Bidirectional Encoder Representations from Transformers) with LLM-based data augmentation. The research utilized the approaches, such as Random Over Sampler, Synthetic Minority Over-sampling Technique (SMOTE), SMOTEENN, data augmentation at word level, class weights, L2 regularization and leveraging GPT-3.5-Turbo's for data augmentation to create additional data samples in imbalance dataset. The results from the experiment demonstrate that the LLM-based data augmentation with Multi-head Attention and BERT in the Myers-Briggs Type Indicator (MBTI) dataset (a highly skewed dataset) achieves the highest precision, recall and F1 score of 0.76 across terms. It indicates that the LLM-based data augmentation has significant improvements in dealing with class imbalance and improves the model's accuracy in minority class types in the MBTI dataset.
Funding Number
2319802
Funding Sponsor
National Science Foundation
Keywords
BERT, GPT 3.5-turbo, imbalance dataset, LLM, Multi-head attention, Myers-Briggs type indicators
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Department
Computer Science
Recommended Citation
Saroj Gopali, Faranak Abri, Akbar Siami Namin, and Keith S. Jones. "The Applicability of LLMs in Generating Textual Samples for Analysis of Imbalanced Datasets" IEEE Access (2024): 136451-136465. https://doi.org/10.1109/ACCESS.2024.3463400