Publication Date
Spring 2012
Degree Type
Master's Project
Degree Name
Master of Science (MS)
Department
Computer Science
Abstract
It is useful to create personalized web crawls, and search through them later on to see the archived content and compare it with current content to see the difference and evolution of that portion of web. It is also useful for searching through the portion of web you are interested in an offline mode without need of going online. To accomplish that, in this project I focus towards indexing of the archive (ARC) files generated by an open source web-crawler named Heritrix. I developed a Java module to perform indexing on these archive files. I used large set of archive files crawled by Heritrix and tested indexing performance of the module. I also benchmarked performance for my indexer and compare these results with various other indexers. The index alone is not of much use until we can use it to search through archives and get search results. To accomplish that, I developed a JSP module using an interface for reading archive files to provide search results. As a whole, when combined with Heritrix, this project can be used to perform personalized crawls, store archive of the crawl, index the archives, and search through those archives.
Recommended Citation
Karia, Darshan, "Full-Text Indexing for Heritrix" (2012). Master's Projects. 241.
DOI: https://doi.org/10.31979/etd.54vq-ux2u
https://scholarworks.sjsu.edu/etd_projects/241