Publication Date
Summer 2025
Degree Type
Thesis
Degree Name
Master of Science (MS)
Department
Computer Engineering
Advisor
Carlos Rojas; Wendy Lee; Bernardo Flores
Abstract
Large Language Models (LLMs) have shown remarkable capabilities in interpreting complex patterns across various domains, yet their application to genomic data remains limited. We see great potential in leveraging LLMs for vital biological tasks, such as predicting transcription factor binding sites and identifying antibiotic-resistant genes. This emergent behavior positions LLMs as powerful tools for enhancing our understanding of intricate biological language. LLMs trained specifically on genomic data, such as DNA sequences, operate distinctly compared to those trained on natural language. This difference is evident not only in the architectural landscape of the models but also in the methodologies employed by tokenizers to handle nucleotide sequences. In this work, we aim to develop and train LLMs tailored for genomic sequences, focusing on the E. coli organism as our initial case study. Our approach utilizes the widely used GPT-2-like model architecture, where we explore various tokenization strategies to quantify the sequence-to-compressed size ratio. Furthermore, we will present an evaluation of the E. coli models, demonstrating their intrinsic performance, and demonstrating how In-Context Learning could be applied to quantify how well the model understands the genomic language. Keywords: Genomics, Large Language Models, Evaluation
Recommended Citation
Kapoor, Aadit, "Understanding and Evaluating Genomic Language Models" (2025). Master's Theses. 5685.
DOI: https://doi.org/10.31979/etd.ku9u-4cne
https://scholarworks.sjsu.edu/etd_theses/5685