Publication Date

Spring 2023

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science

First Advisor

Ching-Seh Wu

Second Advisor

Genya Ishigaki

Third Advisor

Fabio Di Troia


Natural Language Processing, Text-to-Speech, Normalization, Machine Learning, Transformer, Language Models, Pre-trained BERT, Human-Computer Interaction


Text-to-Speech (TTS) normalization is an essential component of natural language processing (NLP) that plays a crucial role in the production of natural-sounding synthesized speech. However, there are limitations to the TTS normalization procedure. Lengthy input sequences and variations in spoken language can present difficulties. The motivation behind this research is to address the challenges associated with TTS normalization by evaluating and comparing the performance of various models. The aim is to determine their effectiveness in handling language variations. The models include LSTM-GRU, Transformer, GCN-Transformer, GCNN-Transformer, Reformer, and a BERT language model that has been pre-trained. The research evaluates the performance of these models using a variety of metrics, including accuracy, loss, word error rate, and sentence error rate. For the primary experiments, Google's TTS Wikipedia dataset was used. In order to evaluate the efficacy of TTS normalization models on inconsistent language, such as slang, this research paper produces a relatively small Twitter dataset. The dataset was manually annotated to provide the models with additional evaluation metrics. The inclusion of this dataset offers further insights into the models' effectiveness in handling variations in language. The results of this study demonstrated that the Reformer model with BERT tokenizer achieved the highest accuracy on both datasets, while the Reformer model with BPE tokenizer had low word and sentence error rates and performed better on longer input sequences. The GCN-Transformer and GCNN- Transformer models also performed well, with the GCNN-Transformer outperforming its counterpart and RNN implementations. We observed that although the BERT model had the advantage of pre-training, the Reformer model could compete with an accuracy of 96% without pre-trained data. These findings highlight the significance of precise TTS normalization models for natural language generation and human-computer interaction. Our study contributes to the ongoing effort to enhance TTS normalization models.