Visiorag: A Multimodal Framework for Enhancing Recommendation System Using Vision Transformers and Rag

Publication Date

1-1-2025

Document Type

Conference Proceeding

Publication Title

Proceedings 2025 IEEE Conference on Artificial Intelligence Cai 2025

DOI

10.1109/CAI64502.2025.00025

First Page

114

Last Page

119

Abstract

In the rapidly advancing field of visual search technology, traditional methods relying solely on visual features often struggle with accuracy and relevance, especially in ecommerce, where precise recommendations are crucial. Issues like keyword stuffing in product descriptions further compound these challenges. To overcome these limitations, we present VisioRAG, a multimodal recommendation framework that integrates visual and textual features. By utilizing Vision Transformers (ViT) for input query categorization and Retrieval-Augmented Generation (RAG) with large language models (LLMs) for image captioning and query enhancement, VisioRAG converts image queries into contextual word embeddings. These embeddings, along with the original image queries, form the multimodal input to the framework. The system leverages Florence-2-large for image captioning, BERT for contextual embedding generation, and Google's Gemini for caption enhancement via prompt engineering. An early fusion technique effectively merges visual and textual vectors. Recommendations are made using cosine similarity, ensuring product matches that align with user intent and preferences. Our evaluation using Amazon product data across five categories shows that the fusion approach with RAG achieves the highest precision (0.9333 pm 0.1294), surpassing other methods. This demonstrates VisioRAG's potential to improve product recommendations and customer satisfaction by leveraging generative AI for human-like text generation.

Keywords

E-commerce Recommendation, Image Captioning, Large language Models (LLMs), Retrieval Augmented Generation, Vision Transformers

Department

Applied Data Science

Share

COinS