Publication Date
Spring 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Faranak Abri
Second Advisor
Sayma Akther
Third Advisor
Zhen Fan
Keywords
Image Captioning, Deep Learning, Neural Network, Transformer
Abstract
Image captioning, which provides a textual understanding of visual content, is the fundamental support for the advancement of Human-A.I. Interaction technology. In the hope of exploring the application of such technology, this project focuses on two specific goals. One is to directly explore the application of the image informationretrieving abilities, and the other is to dive into the specifics of the pipeline and components of image captioning models. As a result, this project presents a working app that exploits the text retrieval functionalities to enable image storage with functions like tagging and transcription. It also supports search functionality with a clear usage flow and model deployment which facilitates data security. This project also includes exploration of two trained models. One is based on Resnet50 and LSTM with adaptive attention mechanism and the other is a Qformer mid layer adapter with pretrained frozen ViT and GPT2.
Recommended Citation
Fan, Zixiao, "Image-to-Text Transcription: Analyzing and Describing Visual Content" (2025). Master's Projects. 1532.
DOI: https://doi.org/10.31979/etd.g4e2-2eha
https://scholarworks.sjsu.edu/etd_projects/1532