Publication Date
Spring 2023
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
First Advisor
Ching-Seh Wu
Second Advisor
Genya Ishigaki
Third Advisor
Fabio Di Troia
Keywords
Natural Language Processing, Text-to-Speech, Normalization, Machine Learning, Transformer, Language Models, Pre-trained BERT, Human-Computer Interaction
Abstract
Text-to-Speech (TTS) normalization is an essential component of natural language processing (NLP) that plays a crucial role in the production of natural-sounding synthesized speech. However, there are limitations to the TTS normalization procedure. Lengthy input sequences and variations in spoken language can present difficulties. The motivation behind this research is to address the challenges associated with TTS normalization by evaluating and comparing the performance of various models. The aim is to determine their effectiveness in handling language variations. The models include LSTM-GRU, Transformer, GCN-Transformer, GCNN-Transformer, Reformer, and a BERT language model that has been pre-trained. The research evaluates the performance of these models using a variety of metrics, including accuracy, loss, word error rate, and sentence error rate. For the primary experiments, Google's TTS Wikipedia dataset was used. In order to evaluate the efficacy of TTS normalization models on inconsistent language, such as slang, this research paper produces a relatively small Twitter dataset. The dataset was manually annotated to provide the models with additional evaluation metrics. The inclusion of this dataset offers further insights into the models' effectiveness in handling variations in language. The results of this study demonstrated that the Reformer model with BERT tokenizer achieved the highest accuracy on both datasets, while the Reformer model with BPE tokenizer had low word and sentence error rates and performed better on longer input sequences. The GCN-Transformer and GCNN- Transformer models also performed well, with the GCNN-Transformer outperforming its counterpart and RNN implementations. We observed that although the BERT model had the advantage of pre-training, the Reformer model could compete with an accuracy of 96% without pre-trained data. These findings highlight the significance of precise TTS normalization models for natural language generation and human-computer interaction. Our study contributes to the ongoing effort to enhance TTS normalization models.
Recommended Citation
Dholakia, Pankti, "Comparative Analysis of Transformer-Based Models for Text-To-Speech Normalization" (2023). Master's Projects. 1242.
DOI: https://doi.org/10.31979/etd.5dd6-k38w
https://scholarworks.sjsu.edu/etd_projects/1242