Master of Science (MS)
Internet search has become an essential part of almost everyone’s daily life and work. To make wise personal and business decisions in a timely fashion, one must access the most relevant information efficiently. Because the amount of information on the Internet is enormous, it is important that a search engine ranks the information appropriately when it presents search results to users. Latent Semantic Indexing (LSI) addresses relevance ranking based on how significant a search word is in each document. Some innovative approaches of computing higher dimensional LSI (HD-LSI) were explored in this project. In traditional LSI, the term frequency-inverse document frequency (TFIDF) is calculated based on how significant a single word is in a document. The goal of this project is to generalize LSI to higher dimensions regarding the traditional LSI as the one-dimensional special case. A benefit of the project is to enable a search engine to rank documents based on the special meaning of multi-word phrases, such as “wall street,” which is captured by a two-dimensional LSI method. Another benefit of the project is the reusable Java software components that compute HD-LSI and store the indexes into a relational database, from which many types of applications can access the HD-LSI data. The software components may be reused for studying the proximity of semantics among documents in high dimensional space in future research. Besides the software engineering aspect, this project contributes to computer science by studying the different approaches to HD-LSI computation. In particular, the dimensional trends in each case were analyzed.
Vo, Mong-Hang, "Automatic Extraction of Keywords and Co-occurrence Keyword Sets" (2006). Master's Projects. Paper 25.