Publication Date

Spring 2018

Degree Type

Master's Project

Degree Name

Master of Science (MS)


Computer Science


Computer Vision is a scientific discipline which involves the development of an algorithmic basis for the construction of intelligent systems that aim at analysis, understanding and extraction of useful information from visual data. This visual data can be plain images, video sequences, views from multiple cameras, etc. Natural Language Processing (NLP), is the ability of machines to read and understand human languages. Visual Question Answering (VQA), is a multi-discipline Artificial Intelligence (AI) research problem, which is a combination of Natural Language Processing (NLP), Computer Vision (CV), and Knowledge Reasoning (KR). Given an image and a question related to the image in natural language, the algorithm has to output an accurate natural language answer. Since the questions are open-ended, the system requires a very detailed understanding of the image, its context and a broad set of AI capabilities – object detection, activity recognition and knowledge-based reasoning. Since the release of the VQA dataset in 2014, numerous datasets and algorithms for VQA have been put forward. In this work, we propose a new baseline for the problem of visual question answering. Our model uses a deep residual network (ResNet) to compute the image features and ByteNet to compute question embeddings. A soft attention mechanism is used to focus on most relevant image features and a classifier is used to generate probabilities over an answer set. We implemented the solution in TensorFlow, which is an open source deep-learning platform, developed by Google. iv Prior to using deep residual network (ResNet) and ByteNet, we tried using VGG16 for extracting image features and long short-term memory units (LSTM) for extracting question features. We observed that using ResNet and ByteNet resulted in an improved accuracy when compared to using VGG16 and LSTM. We evaluate our model on three major image question answering datasets: DAQUAR-ALL, COCO-QA and The VQA Dataset. Our model, despite having a relatively simple architecture, achieves 64.6% accuracy on VQA 1.0 dataset and 59.7% accuracy on VQA 2.0 dataset.