Publication Date

Fall 2025

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Applied Data Science

Advisor

Mohammad Masum; Saptarshi Sengupta; Vishnu Pendyala

Abstract

Recent progress in LLMs enables advanced multimodal understanding, but their high computational cost necessitates monetization strategies like interactive advertising. While bounding boxes show promise for this concept, they can lack precision and visual appeal. Image segmentation offers a superior solution but faces a dual problem: traditional models demand scarce, costly training data, and open-vocabulary segmentation models like SAM are class-agnostic, unable to semantically identify a "consumer product" object class. In this research, we address these limitations by: 1) developing the Prompt-Guided Inpainting Framework (PGIF), which injects negative prompts to generate robustly annotated synthetic segmented product images; 2) investigating the Class-Agnostic Retail Segmentation and Verification Engine (CARVE), a novel open-vocabulary framework to segment unseen products without retraining; and 3) exploring an agentic AI framework where an LLM orchestrates specialized computer vision tools. The findings validate these approaches with key performance metrics. Augmenting a limited real dataset with synthetic images boosted model performance by up to 0.153 mAP@50. The novel open-vocabulary CARVE framework achieved superior mask quality (0.9325 mIoU) and recall (0.9491) over baselines. Finally, an efficient agentic framework orchestrating specialized tools reduced inference time by 2.1 seconds and cost by 72.6% compared to a zero-shot baseline, while delivering improved segmentation. This work advances the accessibility and practical application of image segmentation where efficiency and cost are key.

Share

COinS