Faculty Research, Scholarly, and Creative Activity

The Applicability of LLMs in Generating Textual Samples for Analysis of Imbalanced Datasets

Saroj Gopali, Edward E. Whitacre Jr. College of Engineering
Faranak Abri, San Jose State UniversityFollow
Akbar Siami Namin, Edward E. Whitacre Jr. College of Engineering
Keith S. Jones, Texas Tech University

Publication Date

1-1-2024

Document Type

Article

Publication Title

IEEE Access

Volume

DOI

10.1109/ACCESS.2024.3463400

First Page

136451

Last Page

136465

Abstract

In machine learning class imbalance is a pressing issue, where the model is biased towards the majority classes and underperforms in the minority classes. In textual data, the natural language processing (NLP) model bias significantly reduces overall accuracy, along with poor performance in minority classes. This paper investigates and compares the performance of transformer-based models, such as Multi-head Attention with the data levels and algorithmic levels approaches and BERT (Bidirectional Encoder Representations from Transformers) with LLM-based data augmentation. The research utilized the approaches, such as Random Over Sampler, Synthetic Minority Over-sampling Technique (SMOTE), SMOTEENN, data augmentation at word level, class weights, L2 regularization and leveraging GPT-3.5-Turbo's for data augmentation to create additional data samples in imbalance dataset. The results from the experiment demonstrate that the LLM-based data augmentation with Multi-head Attention and BERT in the Myers-Briggs Type Indicator (MBTI) dataset (a highly skewed dataset) achieves the highest precision, recall and F1 score of 0.76 across terms. It indicates that the LLM-based data augmentation has significant improvements in dealing with class imbalance and improves the model's accuracy in minority class types in the MBTI dataset.

Funding Number

2319802

Funding Sponsor

National Science Foundation

Keywords

BERT, GPT 3.5-turbo, imbalance dataset, LLM, Multi-head attention, Myers-Briggs type indicators

Creative Commons License

This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Department

Computer Science

Recommended Citation

Saroj Gopali, Faranak Abri, Akbar Siami Namin, and Keith S. Jones. "The Applicability of LLMs in Generating Textual Samples for Analysis of Imbalanced Datasets" IEEE Access (2024): 136451-136465. https://doi.org/10.1109/ACCESS.2024.3463400

Download

Find in your library

COinS

Faculty Research, Scholarly, and Creative Activity

The Applicability of LLMs in Generating Textual Samples for Analysis of Imbalanced Datasets

Publication Date

Document Type

Publication Title

Volume

DOI

First Page

Last Page

Abstract

Funding Number

Funding Sponsor

Keywords

Creative Commons License

Department

Recommended Citation

Search

Browse All

Links

Faculty Research, Scholarly, and Creative Activity

The Applicability of LLMs in Generating Textual Samples for Analysis of Imbalanced Datasets

Authors

Publication Date

Document Type

Publication Title

Volume

DOI

First Page

Last Page

Abstract

Funding Number

Funding Sponsor

Keywords

Creative Commons License

Department

Recommended Citation

Share

Search

Browse All

Links