Publication Date

Fall 2015

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

Abstract

Document similarity is similar to the nearest neighbour problem and has applications in various domains. In order to determine the similarity / dissimilarity of the documents first they need to be converted into sets containing shingles. Each document is converted into k-shingles, k being the length of each shingle. The similarity is calculated using Jaccard distance between sets and output into a characteristic matrix, the complexity to parse this matrix is significantly high especially when the sets are large. In this project we explore various approaches such as Min hashing, LSH & Bloom Filter to decrease the matrix size and to improve the time complexity. Min hashing creates a signature matrix which significantly smaller compared to a characteristic matrix. In this project we will look into Min-Hashing implementation, pros and cons. Also we will explore Locality Sensitive Hashing, Bloom Filters and their advantages.

Recommended Citation

Nagireddy, Siddartha Reddy, "Scalable Techniques for Similarity Search" (2015). Master's Projects. 438.
DOI: https://doi.org/10.31979/etd.w9ed-wqnd
https://scholarworks.sjsu.edu/etd_projects/438

Download

Included in

Computer Sciences Commons

COinS

DOI

https://doi.org/10.31979/etd.w9ed-wqnd

Master's Projects

Scalable Techniques for Similarity Search

Publication Date

Degree Type

Degree Name

Department

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Projects

Scalable Techniques for Similarity Search

Author

Publication Date

Degree Type

Degree Name

Department

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links