Publication Date
Spring 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
William Andreopoulos
Second Advisor
Wendy Lee
Third Advisor
Anurag Wasankar
Keywords
Insecticidal genes, Bacterial Genomics, Functional Gene Predic- tion, Deep Learning, Transformer Models, Biological Language Models, K-mer Tokenization.
Abstract
Identification of bacterial gene sequences with agricultural applications has the potential to transform agricultural biotechnology. These genes can be used in environmentally friendly pest control strategies. One such use case is identifying genes with potential insecticidal properties. With an increasing number of genomic information and decreasing numbers of available annotated sequences, finding new insecticidal genes has become more challenging.The traditional methods relying on sequence alignment and annotated databases are not effective in detecting functionally relevant genes lacking close homology to known cases. This project investigates the data-driven classification of genes by sequence modeling. This research is focused on learning DNA sequence motifs and transferring them to distinguish between insecticidal and non-insecticidal genes. The study exhibits that decision-making functional information may be obtained from DNA with state-of-the-art machine learning methodologies and that deep models are capable of generalization to low-resource environments.
Recommended Citation
Chavan, Manvendra, "Large Language Models for Bacterial Genomic Analysis" (2025). Master's Projects. 1568.
DOI: https://doi.org/10.31979/etd.573m-zvbd
https://scholarworks.sjsu.edu/etd_projects/1568