Publication Date

Fall 2015

Degree Type

Master's Project


Computer Science


In this world of Internet, there is a rapid amount of growth in data both in terms of size and dimension. It consists of web pages that represents human thoughts. These thoughts involves concepts and associations which we can capture. Using mathematics, we can perform meaningful clustering of these pages. This project aims at providing a new problem solving paradigm known as algebraic topology in data science. Professor Vasant Dhar, Editor-In-Chief of Big Data (Professor at NYU) define data science as a generalizable extraction of knowledge from data. The core concept of semantic based search engine project developed by my team is to extract a high frequency finite sequence of keywords by association mining. Each frequent finite keywords sequences represent a human concept in a document set. The collective view of such a collection concepts represent a piece of human knowledge. So this MS project is a data science project. By regarding each keyword as an abstract vertex, a finite sequence of keywords becomes a simplex, and the collection becomes a simplicial complexes. Based on this geometric view, new type of clustering can be performed here. If two concepts are connected by n-simplex, we say that these two simplex are connected. Those connected components will be captured by Homology Theory of Simplicial Complexes. The input data for this project are ten thousand files about data mining which are downloaded from IEEE explore library. The search engine nowadays deals with large amount of high dimensional data. Applying mathematical concepts and measuring the connectivity for ten thousand files will be a real challenge. Since, using algebraic topology is a complete new approach. Therefore, extensive testing has to be performed to verify the results for homology groups obtained.