Publication Date

Fall 2025

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Engineering

Advisor

Jun Liu; Mahima Agumbe Suresh; Wencen Wu

Abstract

Vision-Language Models (VLMs) have emerged as transformative technologies for multimodal AI, yet they face significant hurdles in processing text-rich images required for enterprise applications like document understanding, medical imaging, and industrial inspection. Current VLMs struggle with accurate text extraction and reasoning, often exhibiting high hallucination rates and poor Optical Character Recognition (OCR) token utilization. To address these limitations, this research presents a comprehensive framework for optimizing parameter-efficient Low-Rank Adaptation (LoRA) fine-tuning strategies on state-of-the-art architectures, including LLaVA-1.5 and BLIVA-FlanT5. Our methodology integrates enhanced OCR token utilization, faithful caption generation, and specific hallucination mitigation techniques. We employ a multi-dimensional evaluation protocol encompassing traditional metrics (BLEU-4, ROUGE-L, CIDEr), hallucination assessments via CHAIR frameworks, and novel OCR effectiveness measures such as Unanswerable Answer Token Rate analysis to systematically compare baselines and reranking strategies across TextVQA and image captioning benchmarks. Experimental validation demonstrates that our approach yields substantial improvements in text-rich image understanding, establishing that BLIVA-FlanT5 architectures achieve superior performance over LLaVA-1.5 baselines while maintaining better control over hallucinations. The effective application of LoRA fine-tuning, combined with enhanced OCR token integration, significantly boosts TextVQA accuracy and grounding metrics, while faithful caption generation approaches improve semantic coherence. These contributions provide empirically validated benchmarks for parameter-efficient VLM adaptation and offer a scalable, practical solution for industries requiring high-accuracy visual document processing.

Recommended Citation

Malini, Karthik Ganesh, "Parameter-Efficient Multimodal Adaptation: OCR-Integrated Lora for TextVQA and Captioning" (2025). Master's Theses. 5733.
DOI: https://doi.org/10.31979/etd.47dw-kbuk
https://scholarworks.sjsu.edu/etd_theses/5733

Download

Available for download on Saturday, August 15, 2026

Included in

Computer Engineering Commons

COinS

DOI

https://doi.org/10.31979/etd.47dw-kbuk

Master's Theses

Parameter-Efficient Multimodal Adaptation: OCR-Integrated Lora for TextVQA and Captioning

Publication Date

Degree Type

Degree Name

Department

Advisor

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Theses

Parameter-Efficient Multimodal Adaptation: OCR-Integrated Lora for TextVQA and Captioning

Author

Publication Date

Degree Type

Degree Name

Department

Advisor

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links