Chak Li

Publication Date

Fall 2012

Degree Type

Master's Project


Computer Science


bstract Due to the need to organize a vast amount of documents available in the Internet, the automated semantic extraction representing webpages has become a popular research topic in both industry and academia. The purpose of this project is to introduce a new method to process documents to extract the original contextual representations and yet to extend additional and connect similar representations based on the semantics underneath the extracted representations in an automatic fashion. Among the purposed steps, the core of this project is to tackle the difficulty to construct a mechanism in which machines can computationally understand the lexical meaning of the extracted semantic representations. For instance, the word “good” has the same lexical meaning as the word “well”, so both should be equally treated. Furthermore, the 2-gram “wall street” should be kept as-is instead of tokenizing it into two single words, but “coffee or tea” should be tokenized into two single words “coffee” and “tea”. This is important in text mining to keep but not to destruct the original semantics so one can further process documents safely, efficiently, and accurately. In the project, I first discuss the adequate machine learning method introduced by Professor Lin to process documents to extract the original contextual representations, namely primitive concepts. Then, I introduce new methods to apply the extracted concepts to extract additional and connect similar representation based on the semantics underneath using the WordNet database. In the last section of the report, I examined the proposed data processing method with sample data and justified the empirical results with data provided by Google Search. The project well articulates the problems of computation cost reduction and prediction enhancement in contextual extraction for documents. In general, most of the machine-learning article is well written and informative for general readers with Mathematics background, but not necessarily for readers of engineering interest. In the report, an engineering mechanism is constructed with mathematical reasoning to 4persuade readers with theoretical background. Both readers from the engineering and mathematical communities are not to be left without an engineering and theoretical understanding of the methods introduced in the project.