Publication Date
Fall 2025
Degree Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Engineering
Advisor
Jun Liu; Mahima Agumbe Suresh; Wencen Wu
Abstract
Vision-Language Models (VLMs) have emerged as transformative technologies for multimodal AI, yet they face significant hurdles in processing text-rich images required for enterprise applications like document understanding, medical imaging, and industrial inspection. Current VLMs struggle with accurate text extraction and reasoning, often exhibiting high hallucination rates and poor Optical Character Recognition (OCR) token utilization. To address these limitations, this research presents a comprehensive framework for optimizing parameter-efficient Low-Rank Adaptation (LoRA) fine-tuning strategies on state-of-the-art architectures, including LLaVA-1.5 and BLIVA-FlanT5. Our methodology integrates enhanced OCR token utilization, faithful caption generation, and specific hallucination mitigation techniques. We employ a multi-dimensional evaluation protocol encompassing traditional metrics (BLEU-4, ROUGE-L, CIDEr), hallucination assessments via CHAIR frameworks, and novel OCR effectiveness measures such as Unanswerable Answer Token Rate analysis to systematically compare baselines and reranking strategies across TextVQA and image captioning benchmarks. Experimental validation demonstrates that our approach yields substantial improvements in text-rich image understanding, establishing that BLIVA-FlanT5 architectures achieve superior performance over LLaVA-1.5 baselines while maintaining better control over hallucinations. The effective application of LoRA fine-tuning, combined with enhanced OCR token integration, significantly boosts TextVQA accuracy and grounding metrics, while faithful caption generation approaches improve semantic coherence. These contributions provide empirically validated benchmarks for parameter-efficient VLM adaptation and offer a scalable, practical solution for industries requiring high-accuracy visual document processing.
Recommended Citation
Malini, Karthik Ganesh, "Parameter-Efficient Multimodal Adaptation: OCR-Integrated Lora for TextVQA and Captioning" (2025). Master's Theses. 5733.
DOI: https://doi.org/10.31979/etd.47dw-kbuk
https://scholarworks.sjsu.edu/etd_theses/5733