Publication Date
Fall 2015
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
Abstract
Document similarity is similar to the nearest neighbour problem and has applications in various domains. In order to determine the similarity / dissimilarity of the documents first they need to be converted into sets containing shingles. Each document is converted into k-shingles, k being the length of each shingle. The similarity is calculated using Jaccard distance between sets and output into a characteristic matrix, the complexity to parse this matrix is significantly high especially when the sets are large. In this project we explore various approaches such as Min hashing, LSH & Bloom Filter to decrease the matrix size and to improve the time complexity. Min hashing creates a signature matrix which significantly smaller compared to a characteristic matrix. In this project we will look into Min-Hashing implementation, pros and cons. Also we will explore Locality Sensitive Hashing, Bloom Filters and their advantages.
Recommended Citation
Nagireddy, Siddartha Reddy, "Scalable Techniques for Similarity Search" (2015). Master's Projects. 438.
DOI: https://doi.org/10.31979/etd.w9ed-wqnd
https://scholarworks.sjsu.edu/etd_projects/438