Publication Date
Spring 2025
Degree Type
Master's Project
Degree Name
Master of Science in Computer Science (MSCS)
Department
Computer Science
First Advisor
William Andreopoulos
Second Advisor
Thomas Austin
Third Advisor
Sayma Akther
Keywords
Biomedical NLP, Named Entity Recognition, Relation Extraction, CHEMPROT, EU-ADR, BioBERT, BioGPT, bert-base-cased, Knowledge Graph, Multi-hop Reasoning, Natural Language Processing
Abstract
The rapid growth of biomedical research has led to an overwhelming volume of unstructured textual data in the scientific literature. This has necessitated the development of an automated approach for knowledge extraction and integration. In
this project, we present a comprehensive pipeline for constructing a unified biomed- ical knowledge graph by combining two well-known datasets: CHEMPROT [1],
which captures chemical–protein interactions, and EU-ADR [2], which annotates drug–gene–disease relationships. In order to identify important biomedical entities and interactions from CHEMPROT dataset, we perform Named Entity Recognition (NER) and relation Extraction (RE) using state-of-the-art biomedical models like BioBERT [3], BioGPT [4] and bert-base-cased [5] . NER step identifies 647 unique chemicals and 790 gene/protein mentions, while the RE step produces 10,558 relation triples. We create a heterogeneous knowledge graph linking chemical, genes/proteins and disease. The final graph created consisted of 3194 nodes and 7716 edges. Followed by applying multi-hop reasoning(3-hop, 4-hop) to infer new knowledge and relationships, helping to uncover new biomedical insights that are not explicitly mentioned in the literature [6]. Our results show how multi-hop inference enhances the knowledge graph beyond direct annotations.
Recommended Citation
Krishna, Akshat, "Advanced Knowledge Extraction with Biomedical Data Using LLMs" (2025). Master's Projects. 1574.
DOI: https://doi.org/10.31979/etd.nxc3-jtd8
https://scholarworks.sjsu.edu/etd_projects/1574