Darshan Karia

Publication Date

Spring 2012

Degree Type

Master's Project


Computer Science


It is useful to create personalized web crawls, and search through them later on to see the archived content and compare it with current content to see the difference and evolution of that portion of web. It is also useful for searching through the portion of web you are interested in an offline mode without need of going online. To accomplish that, in this project I focus towards indexing of the archive (ARC) files generated by an open source web-crawler named Heritrix. I developed a Java module to perform indexing on these archive files. I used large set of archive files crawled by Heritrix and tested indexing performance of the module. I also benchmarked performance for my indexer and compare these results with various other indexers. The index alone is not of much use until we can use it to search through archives and get search results. To accomplish that, I developed a JSP module using an interface for reading archive files to provide search results. As a whole, when combined with Heritrix, this project can be used to perform personalized crawls, store archive of the crawl, index the archives, and search through those archives.