A Comparative Study of Retrieval-Augmented Generation (RAG) Chatbots

Kalindi Vijesh Parekh, San Jose State University
Navrati Saxena, San Jose State University
Mohammad Adil Ansari

Abstract

The use of Retrieval-Augmented generation (RAG) in chatbot platforms has transformed academic spaces by significantly improving information accessibility. RAG has become a viable approach to upgrading Large Language Models (LLMs) with external knowledge access in real time. With the growing availability of advanced LLMs such as GPT, DeepSeek, Claude, Gemini, and Llama, there is a growing need to compare RAG systems based on different LLMs. This study compares the responses of four different RAG chatbots using popular LLMs against a uniquely designed evaluation dataset. Specifically, the study compares the responses and performance of closed-source (GPT-4o and Claude) and open-source models (DeepSeek and Llama) against questions requiring inference from multiple scientific corpora with intricate content and structure. All RAG models in this research use a Chroma vector database to store embeddings. The retrieved documents and the query are provided as input prompts to the LLMs, thus allowing contextually grounded response construction. Each chatbot is evaluated based on ten complex research papers from various domains in computer science. The specifically designed evaluation dataset contains 75 questions derived from these research papers, with a wide variety of questions ranging from simple yes/no questions to questions requiring an understanding of multiple papers. The responses provided by each chatbot are measured quantitatively using standard measures, including Bilingual Evaluation Understudy (BLEU), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), and Bidirectional Encoder Representations from Transformers (BERT) scores, to evaluate response quality comprehensively.