Publication Date
Fall 2025
Degree Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Engineering
Advisor
Bernardo Flores; Magdalini Eirinaki; Mahima Agumbe Suresh
Abstract
Text summarization models have achieved significant growth during the last few years because of major Large Language Model (LLM) technological advancements. The application of LLMs are widely used in news distribution (TL;DR news), translation tools (DeepL Translate), or virtual assistants (ChatGPT, DeepSeek, Claude, etc.). However, the progress has not yet reached all languages equally. The Vietnamese language is used by more than 90 million people, but the language is not as highly developed for LLM as it has many homophones, five different tones that affect meaning of words, and irregular grammar compared to other languages (e.g. English, Spanish, etc.). Also, the resources for the Vietnamese language are limited. Therefore, particularly in the context of LLMs, the Vietnamese application faces a certain gap when it is put alongside others. This can lead to many disadvantages for the Vietnamese monolingual population around the world as they cannot catch up with the current informational society. This thesis addresses this gap by conducting a comparative study to fine-tune and evaluate two state-of-the-art Vietnamese-specific models—BartPho-syllable (BART-based) and ViT5 (T5-based)—for the task of abstractive news summarization. The models were trained using a parameter-efficient (QLoRA) approach on a curated corpus combining the public nam194/vietnews dataset with freshly scraped articles from major Vietnamese news outlets. The key finding of this research is that both sophisticated fine-tuned models were ultimately outperformed on standard quantitative metrics (ROUGE and BERTScore) by simpler extractive baselines, particularly the Lead-3 heuristic. The evaluation scores and final validation loss demonstrated that ViT5 achieved better results than BartPho-syllable in the two fine-tuned models.
Recommended Citation
Pho, Tin, "Developing a Vietnamese Text Summarization Large Language Model on Limited Hardware" (2025). Master's Theses. 5747.
DOI: https://doi.org/10.31979/etd.4nta-vb32
https://scholarworks.sjsu.edu/etd_theses/5747