Publication Date
Spring 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
Robert Chun
Second Advisor
Thomas Austin
Third Advisor
Sankalp Dwivedi
Keywords
Data Engineering, ETL (Extract, Transform, Load), LLMs (Large Language Models), QuanDzaDon, QLora (QuanDzed Low-Rank Adapter), Supervised Fine-Tuning
Abstract
This projects discusses the use of Large language model to simplify and automate Extract, Transform, Loading (ETL) development. Data engineering tasks oUen need technical experDse, thereby challenging non-experts and even industry professionals for such Dme intensive operaDons. The project tackles these challenges by employing quanDzaDon of a base Lllama-2-7b-Chat model using QLora technique. The lowered precision of the base model allow further execuDon on a limited hardware resource. The project further performs Supervised fine-tuning trainer (SFT) to perform fine-tuning, specializing the model for generaDng script for ETL tasks. The fine-tuned model is evaluated through a series of micro and comprehensive end-toend ETL tasks as is compared with human-wriQen baseline scripts, demonstraDng robust performance with respect to efficiency, accuracy, and response Dme. The finetuned model achieved 85% accuracy in transformaDon tasks such as schema mapping, and data cleaning operaDons like forma`ng date, and deriving columns from raw unstructured csv files. The model is published on Hugging Face and with 81 downloads Dll date, it provides reasonable soluDon for ETL automaDon. The model also lays the foundaDon for further enhancements in the field.
Recommended Citation
Chandras, Aditya, "Large Language Model powered ETL Pipeline Development A Project Report" (2025). Master's Projects. 1541.
DOI: https://doi.org/10.31979/etd.v7sz-5v3s
https://scholarworks.sjsu.edu/etd_projects/1541