Publication Date

Summer 2025

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Engineering

Advisor

Carlos Rojas; Wendy Lee; Bernardo Flores

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in interpreting complex patterns across various domains, yet their application to genomic data remains limited. We see great potential in leveraging LLMs for vital biological tasks, such as predicting transcription factor binding sites and identifying antibiotic-resistant genes. This emergent behavior positions LLMs as powerful tools for enhancing our understanding of intricate biological language. LLMs trained specifically on genomic data, such as DNA sequences, operate distinctly compared to those trained on natural language. This difference is evident not only in the architectural landscape of the models but also in the methodologies employed by tokenizers to handle nucleotide sequences. In this work, we aim to develop and train LLMs tailored for genomic sequences, focusing on the E. coli organism as our initial case study. Our approach utilizes the widely used GPT-2-like model architecture, where we explore various tokenization strategies to quantify the sequence-to-compressed size ratio. Furthermore, we will present an evaluation of the E. coli models, demonstrating their intrinsic performance, and demonstrating how In-Context Learning could be applied to quantify how well the model understands the genomic language. Keywords: Genomics, Large Language Models, Evaluation

Recommended Citation

Kapoor, Aadit, "Understanding and Evaluating Genomic Language Models" (2025). Master's Theses. 5685.
DOI: https://doi.org/10.31979/etd.ku9u-4cne
https://scholarworks.sjsu.edu/etd_theses/5685

Download

Included in

Computer Engineering Commons

COinS

DOI

https://doi.org/10.31979/etd.ku9u-4cne

Master's Theses

Understanding and Evaluating Genomic Language Models

Publication Date

Degree Type

Degree Name

Department

Advisor

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Theses

Understanding and Evaluating Genomic Language Models

Author

Publication Date

Degree Type

Degree Name

Department

Advisor

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links