Publication Date

Spring 2023

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Ching-Seh Wu

Second Advisor

Chris Pollett

Third Advisor

Robert Chun

Keywords

Keyword Extraction, Natural Language Processing, Information Retrieval, Similarity Score, Wikipedia Web Tables

Abstract

Information retrieval and data interpretation on the web, for the purpose of gaining knowledgeable insights, has been a widely researched topic from the onset of the world wide web or what is today popularly known as the internet. Web tables are structured tabular data present amidst unstructured, heterogenous data on the web. This makes web tables a rich source of information for a variety of tasks like data analysis, data interpretation, and information retrieval pertaining to extracting knowledge from information present on the web. Wikipedia tables which are a subset of web tables hold a huge amount of useful data, that if explored and mined appropriately, prove to be a rich source of important information for knowledge extraction. This research focuses on harnessing the capabilities of natural language processing for the task of information retrieval and data interpretation on the web, specifically web tables and more specifically, Wikipedia web tables to perform the tasks of keyword extraction, search, and ranking on them.

The goal of the project is to create an index using Wikipedia table data and title, to effectively search for pages that match input terms within a Wikipedia corpus and rank the results based on the frequency of presence of the term in the output. In nutshell, the system is a keyword-based search and ranking system. For this purpose, popular traditional NLP models RAKE, TextRank, TF-IDF, and SpaCy are utilized. The results are ranked from most to least relevant. Metrics including precision, recall, and F1 are further used to evaluate the performance of models.

A notable technological contribution made by this work is the creation of strategies for improving keyword extraction by leveraging the power of embedded web tables. According to the experimental findings it is observed that RAKE and TFIDF outperform TextRank and SpaCy in some cases and in other cases the opposite holds true. Furthermore, regardless of the model or dataset in consideration, on comparing the model performance scores with other experiments, our results outperform such studies. The results of this research work give insights into the efficacy of traditional NLP models in tabular text interpretation, keyword-based search, indexing, and ranking of relevant results. It is also found that the results of such an experiment vary largely depending on the dataset and the task at hand. Furthermore, the study also highlights the importance of selecting the right combination of models, dataset and tasks as well as the importance of correct preprocessing to get high performance.

Available for download on Friday, May 24, 2024

Share

COinS