Visiorag: A Multimodal Framework for Enhancing Recommendation System Using Vision Transformers and Rag
Publication Date
1-1-2025
Document Type
Conference Proceeding
Publication Title
Proceedings 2025 IEEE Conference on Artificial Intelligence Cai 2025
DOI
10.1109/CAI64502.2025.00025
First Page
114
Last Page
119
Abstract
In the rapidly advancing field of visual search technology, traditional methods relying solely on visual features often struggle with accuracy and relevance, especially in ecommerce, where precise recommendations are crucial. Issues like keyword stuffing in product descriptions further compound these challenges. To overcome these limitations, we present VisioRAG, a multimodal recommendation framework that integrates visual and textual features. By utilizing Vision Transformers (ViT) for input query categorization and Retrieval-Augmented Generation (RAG) with large language models (LLMs) for image captioning and query enhancement, VisioRAG converts image queries into contextual word embeddings. These embeddings, along with the original image queries, form the multimodal input to the framework. The system leverages Florence-2-large for image captioning, BERT for contextual embedding generation, and Google's Gemini for caption enhancement via prompt engineering. An early fusion technique effectively merges visual and textual vectors. Recommendations are made using cosine similarity, ensuring product matches that align with user intent and preferences. Our evaluation using Amazon product data across five categories shows that the fusion approach with RAG achieves the highest precision (0.9333 pm 0.1294), surpassing other methods. This demonstrates VisioRAG's potential to improve product recommendations and customer satisfaction by leveraging generative AI for human-like text generation.
Keywords
E-commerce Recommendation, Image Captioning, Large language Models (LLMs), Retrieval Augmented Generation, Vision Transformers
Department
Applied Data Science
Recommended Citation
Anitha Balachandran and Mohammad Masum. "Visiorag: A Multimodal Framework for Enhancing Recommendation System Using Vision Transformers and Rag" Proceedings 2025 IEEE Conference on Artificial Intelligence Cai 2025 (2025): 114-119. https://doi.org/10.1109/CAI64502.2025.00025