There has been immense progress in the fields of computer vision, object detection and natural language processing (NLP) in recent years. Artificial Intelligence (AI) systems, such as question answering models, use NLP to provide a "comprehensive" capability to the machine. Such a machine can answer natural language queries about any portion of an unstructured text. An extension of this system is to combine NLP with computer vision to accomplish the task of Visual Question Answering (VQA), which is to build a system that can answer natural language queries about images. A number of systems have been proposed for VQA that use deep-learning architectures and learning algorithms. This project introduces a VQA system that gains deep understanding from images by using a deep convolutional neural network (CNN) that extracts image features. More specifically, the feature embeddings from the output layer of the VGG19  model are used for this purpose. Our system achieves complex reasoning capabilities and understanding of natural language, so that it can correctly interpret the query and return an appropriate answer. The InferSent  model is used for obtaining sentence level embeddings to extract features from the question. Different architectures are proposed to merge the image and language models. Our system achieves results comparable to the baseline systems on the VQA dataset.
Kansara, Pankti, "Visual Question Answering" (2018). Master's Projects. 640.