Publication Date
Spring 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Chris Pollett
Second Advisor
Robert Chun
Third Advisor
Thomas Austin
Keywords
Image Captioning, Assistive Technology, VisionMate, Accessibility, FastAPI, React.js, Hugging Face, GIT-base, Text-to-Speech, Web App, Real-Time Captioning, Frontend Development, Backend Integration, Vercel and Render
Abstract
VisionMate is a web application that generates captions for camera-captured images. It is designed
to assist users with visual impairments by converting visual input into spoken and written text. The
application uses the GIT-base model from Hugging Face, which processes the image and returns a
descriptive caption. Users can take a picture using the device camera—either via webcam on
desktop or the native camera interface on mobile. The app provides audio output using the
SpeechSynthesis API and uses full-screen tap interaction to simplify accessibility.
The frontend is implemented in React.js, and the backend is built with FastAPI. The backend calls
Hugging Face’s Inference API to perform model inference without loading large models locally,
reducing memory usage during deployment. On average, captions are generated in 5 to 8 seconds.
The GIT-base model was selected after comparative testing with BLIP-base, BLIP-large, and GIT-
large. Testing was conducted on Chrome, Safari, and Firefox browsers, using devices such as the
MacBook Pro (M1) and iPhone 15.
This report outlines the system architecture, model comparisons, deployment on Vercel (frontend)
and Render (backend), and evaluation of performance across speed, model accuracy, and device and
browser compatibility.
Recommended Citation
Kokku, Sai Anoushka, "VisionMate: AI-Powered Image Captioning Web Application" (2025). Master's Projects. 1480.
DOI: https://doi.org/10.31979/etd.zg52-ukcj
https://scholarworks.sjsu.edu/etd_projects/1480