Master of Science (MS)
T. Y. Lin
Semantic Web Crawler, Stop words
A Semantic Search Engine (SSE) is a program that produces semantic-oriented concepts from the Internet. A web crawler is the front end of our SSE; its primary goal is to supply important and necessary information to the data analysis component of SSE. The main function of the analysis component is to produce the concepts (moderately frequent finite sequences of keywords) from the input; it uses some variants of TF-IDF as a primary tool to remove stop words. However, it is a very expensive way to filter out stop words using the idea of TF-IDF. The goal of this project is to improve the efficiency of the SSE by avoiding feeding junk data (stop words) to the SSE. In this project, we classify formally three classes of stop words: English-grammar-based stop words, Metadata stop words, and Topic-specific stop words. To remove English-grammar-based stop words, we simply use a list of stop words that can be found on the Internet. For Metadata stop words, we create a simple web crawler and add a modified HTML parser to it. The HTML parser is used to identify and remove Metadata stop words. So, our web crawler can remove most of the Metadata stop words and reduce the processing time of SSE. However, we do not know much about Topic-specific stop words. So, Topic-specific stop words are identified by a randomly selected sample of documents, instead of identifying all keywords (equal or above a threshold) and all stop words (below the threshold) on the whole set of documents. MapReduce is applied to reduce the complexity and find Topic- specific stop words such as “acm” (Association for Computing Machinery) that we find on IEEE data mining papers. Then, we create a Topic-specific stop word list and use it to reduce the processing time of SSE.
Zhang, Shujia, "Intelligent Web Crawler for Semantic Search Engine" (2017). Master's Projects. 508.