Publication Date

Summer 2025

Degree Type

Thesis

Degree Name

Master of Science (MS)

Department

Computer Engineering

Advisor

Carlos Rojas; Wendy Lee; Bernardo Flores

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in interpreting complex patterns across various domains, yet their application to genomic data remains limited. We see great potential in leveraging LLMs for vital biological tasks, such as predicting transcription factor binding sites and identifying antibiotic-resistant genes. This emergent behavior positions LLMs as powerful tools for enhancing our understanding of intricate biological language. LLMs trained specifically on genomic data, such as DNA sequences, operate distinctly compared to those trained on natural language. This difference is evident not only in the architectural landscape of the models but also in the methodologies employed by tokenizers to handle nucleotide sequences. In this work, we aim to develop and train LLMs tailored for genomic sequences, focusing on the E. coli organism as our initial case study. Our approach utilizes the widely used GPT-2-like model architecture, where we explore various tokenization strategies to quantify the sequence-to-compressed size ratio. Furthermore, we will present an evaluation of the E. coli models, demonstrating their intrinsic performance, and demonstrating how In-Context Learning could be applied to quantify how well the model understands the genomic language. Keywords: Genomics, Large Language Models, Evaluation

Share

COinS