Publication Date

Spring 2025

Degree Type

Master's Project

Degree Name

Master of Science in Computer Science (MSCS)

Department

Computer Science

First Advisor

William Andreopoulos

Second Advisor

Thomas Austin

Third Advisor

Sayma Akther

Keywords

Biomedical NLP, Named Entity Recognition, Relation Extraction, CHEMPROT, EU-ADR, BioBERT, BioGPT, bert-base-cased, Knowledge Graph, Multi-hop Reasoning, Natural Language Processing

Abstract

The rapid growth of biomedical research has led to an overwhelming volume of unstructured textual data in the scientific literature. This has necessitated the development of an automated approach for knowledge extraction and integration. In

this project, we present a comprehensive pipeline for constructing a unified biomed- ical knowledge graph by combining two well-known datasets: CHEMPROT [1],

which captures chemical–protein interactions, and EU-ADR [2], which annotates drug–gene–disease relationships. In order to identify important biomedical entities and interactions from CHEMPROT dataset, we perform Named Entity Recognition (NER) and relation Extraction (RE) using state-of-the-art biomedical models like BioBERT [3], BioGPT [4] and bert-base-cased [5] . NER step identifies 647 unique chemicals and 790 gene/protein mentions, while the RE step produces 10,558 relation triples. We create a heterogeneous knowledge graph linking chemical, genes/proteins and disease. The final graph created consisted of 3194 nodes and 7716 edges. Followed by applying multi-hop reasoning(3-hop, 4-hop) to infer new knowledge and relationships, helping to uncover new biomedical insights that are not explicitly mentioned in the literature [6]. Our results show how multi-hop inference enhances the knowledge graph beyond direct annotations.

Available for download on Monday, June 15, 2026

Share

COinS