Publication Date
2006
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
Abstract
This thesis evaluates the effectiveness of using a combinatorial topology structure (a simplicial complex) for document clustering. It is believed that a simplicial complex better identifies the latent concept space defined by a collection of documents than the use of hypergraphs or human categorization. The complex is constructed using groups of co-occurring words (term associations) identified using traditional data mining methods. Disjoint subsections of the complex (connect components) represent general concepts within the documents’ concept space. Documents clustered to these connect components will produce meaningful groupings. Instead, the most specific concepts (maximal simplices) are used as representative connect components to demonstrate this technique’s effectiveness. Each document in a cluster is compared against its human assigned category to determine the cluster’s precision. It is shown that this technique is better able to cluster documents than human classifiers.
Recommended Citation
Lind, Kevin, "Concept Based Document Clustering using a Simplicial Complex, a Hypergraph" (2006). Master's Projects. 22.
DOI: https://doi.org/10.31979/etd.w78q-45as
https://scholarworks.sjsu.edu/etd_projects/22