Publication Date

Spring 2023

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

William Andreopoulos

Second Advisor

Fabio Di Troia

Third Advisor

Nada Attar

Keywords

Image Captioning, RL (Reinforcement Learning), CNN (Convolutional Neural Network), RNN (Recurrent Neural Network)

Abstract

Image captioning is a crucial technology with numerous applications, including enhancing accessibility for the visually impaired, developing automated image indexing and retrieval systems, and enriching social media experiences. However, accurately describing the content of an image in natural language remains a challenge, particularly in low-resource settings where data and computational power are limited. The most advanced image captioning architectures currently use encoder-decoder structures that incorporate a sequential recurrent prediction model. This study adopts a typical Convolutional Neural Network (CNN) encoder Recurrent Neural Network (RNN) decoder structure for image captioning, but it has framed the problem as a sequential decision-making task. The image captioning models in this research used reinforcement learning (RL) as a means of training to improve performance. The study uses a policy network to anticipate the following word in a caption based on earlier predicted words and a value network to assess the entire caption and its possible variations. Both these networks have been trained using a reinforcement learning model that relies on visual-semantic embeddings. This method outperforms the standard encoder-decoder framework even with minimal training on a smaller subset of the Microsoft COCO dataset.

Share

COinS