Off-campus SJSU users: To download campus access theses, please use the following link to log into our proxy server with your SJSU library user name and PIN.

Publication Date

Fall 2025

Degree Type

Thesis - Campus Access Only

Degree Name

Master of Science (MS)

Department

Applied Data Science

Advisor

Mohammad Masum; Saptarshi Sengupta; Vishnu Pendyala

Abstract

Large Language Models (LLMs) are increasingly used for analysing structured data, but most current methods such as translating questions to SQL or using full-table natural language prompts still face several practical issues. SQL-based systems tend to be rigid and error-prone when schemas vary, while full-table prompts are inefficient and often fail when the model struggles with large or complex data. In this work, a new framework called Complexity-Aware Prompting (CAP) is proposed to make natural language–to–code generation more reliable for tabular data. The idea is to break down each user question into fifteen basic Pandas operations and use those to retrieve examples that are similar not only in meaning but also in the kind of computation they require. These examples are then combined with reasoning-based prompting methods such as Chain-of-Thought and Tree-of-Thought to help the model think through the task before generating code. When tested on the DataBench benchmark, the proposed framework reached 92.00% executable accuracy using Claude-3.7-Sonnet, which is ~22% better than instruction-only prompts and ~15% higher than existing retrieval-based methods. It also reduced schema-related errors by around one-third. Overall, this approach provides a practical way to make LLMs more consistent and trustworthy for everyday data analysis tasks. Keywords: natural language to code, tabular question answering, large language models, complexity-aware prompting, complexity alignment, executable accuracy

Share

COinS