Publication Date

Spring 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

Chris Pollett

Second Advisor

Robert Chun

Third Advisor

Thomas Austin

Keywords

Image Captioning, Assistive Technology, VisionMate, Accessibility, FastAPI, React.js, Hugging Face, GIT-base, Text-to-Speech, Web App, Real-Time Captioning, Frontend Development, Backend Integration, Vercel and Render

Abstract

VisionMate is a web application that generates captions for camera-captured images. It is designed
to assist users with visual impairments by converting visual input into spoken and written text. The
application uses the GIT-base model from Hugging Face, which processes the image and returns a
descriptive caption. Users can take a picture using the device camera—either via webcam on
desktop or the native camera interface on mobile. The app provides audio output using the
SpeechSynthesis API and uses full-screen tap interaction to simplify accessibility.

The frontend is implemented in React.js, and the backend is built with FastAPI. The backend calls
Hugging Face’s Inference API to perform model inference without loading large models locally,
reducing memory usage during deployment. On average, captions are generated in 5 to 8 seconds.
The GIT-base model was selected after comparative testing with BLIP-base, BLIP-large, and GIT-
large. Testing was conducted on Chrome, Safari, and Firefox browsers, using devices such as the
MacBook Pro (M1) and iPhone 15.

This report outlines the system architecture, model comparisons, deployment on Vercel (frontend)
and Render (backend), and evaluation of performance across speed, model accuracy, and device and
browser compatibility.

Available for download on Wednesday, May 20, 2026

Share

COinS