Publication Date

Fall 2025

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Engineering

Advisor

Jun Liu; Mahima Agumbe Suresh; Wencen Wu

Abstract

Vision-Language Models (VLMs) have emerged as transformative technologies for multimodal AI, yet they face significant hurdles in processing text-rich images required for enterprise applications like document understanding, medical imaging, and industrial inspection. Current VLMs struggle with accurate text extraction and reasoning, often exhibiting high hallucination rates and poor Optical Character Recognition (OCR) token utilization. To address these limitations, this research presents a comprehensive framework for optimizing parameter-efficient Low-Rank Adaptation (LoRA) fine-tuning strategies on state-of-the-art architectures, including LLaVA-1.5 and BLIVA-FlanT5. Our methodology integrates enhanced OCR token utilization, faithful caption generation, and specific hallucination mitigation techniques. We employ a multi-dimensional evaluation protocol encompassing traditional metrics (BLEU-4, ROUGE-L, CIDEr), hallucination assessments via CHAIR frameworks, and novel OCR effectiveness measures such as Unanswerable Answer Token Rate analysis to systematically compare baselines and reranking strategies across TextVQA and image captioning benchmarks. Experimental validation demonstrates that our approach yields substantial improvements in text-rich image understanding, establishing that BLIVA-FlanT5 architectures achieve superior performance over LLaVA-1.5 baselines while maintaining better control over hallucinations. The effective application of LoRA fine-tuning, combined with enhanced OCR token integration, significantly boosts TextVQA accuracy and grounding metrics, while faithful caption generation approaches improve semantic coherence. These contributions provide empirically validated benchmarks for parameter-efficient VLM adaptation and offer a scalable, practical solution for industries requiring high-accuracy visual document processing.

Available for download on Saturday, August 15, 2026

Share

COinS