Publication Date

Spring 3-2-2016

Degree Type

Master's Project


Computer Science

First Advisor

T. Y. Lin

Second Advisor

Robert Chun

Third Advisor

Eric Louie


Data on the internet is increasing exponentially every single second. There are billions and billions of documents on the World Wide Web (The Internet). Each document on the internet contains multiple concepts (an abstract or general idea inferred from specific instances).

In this paper, we show how we created and implemented an algorithm for extracting concepts from a set of documents. These concepts can be used by a search engine for generating search results to cater the needs of the user. The search result will then be more targeted than the usual keyword search.

The main problem was to extract concepts from a set of documents. Each page could have thousands of combinations that could be potential concepts. An average document could have millions of concepts. Combine that to the vast amount of data on the web, we are talking about an enormous amount of dataset and samples. As a result, the main areas of concern are the main memory constraints and the time complexity of the algorithm.

This paper introduces an algorithm which is scalable, independent of the main memory and has a linear time complexity.