Publication Date

Spring 2012

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

Abstract

It is useful to create personalized web crawls, and search through them later on to see the archived content and compare it with current content to see the difference and evolution of that portion of web. It is also useful for searching through the portion of web you are interested in an offline mode without need of going online. To accomplish that, in this project I focus towards indexing of the archive (ARC) files generated by an open source web-crawler named Heritrix. I developed a Java module to perform indexing on these archive files. I used large set of archive files crawled by Heritrix and tested indexing performance of the module. I also benchmarked performance for my indexer and compare these results with various other indexers. The index alone is not of much use until we can use it to search through archives and get search results. To accomplish that, I developed a JSP module using an interface for reading archive files to provide search results. As a whole, when combined with Heritrix, this project can be used to perform personalized crawls, store archive of the crawl, index the archives, and search through those archives.

Recommended Citation

Karia, Darshan, "Full-Text Indexing for Heritrix" (2012). Master's Projects. 241.
DOI: https://doi.org/10.31979/etd.54vq-ux2u
https://scholarworks.sjsu.edu/etd_projects/241

Download

Included in

Computer Sciences Commons

COinS

DOI

https://doi.org/10.31979/etd.54vq-ux2u

Master's Projects

Full-Text Indexing for Heritrix

Publication Date

Degree Type

Degree Name

Department

Abstract

Recommended Citation

Included in

DOI

Search

Browse All

Links

Master's Projects

Full-Text Indexing for Heritrix

Author

Publication Date

Degree Type

Degree Name

Department

Abstract

Recommended Citation

Included in

Share

DOI

Search

Browse All

Links