Publication Date

Spring 2013

Degree Type

Master's Project

Department

Computer Science

Abstract

This project adds new cache-related features to Yioop, an Open Source, PHP-based search engine. Search engines often maintain caches of pages downloaded by their crawler. Commercial search engines like Google display a link to the cached version of a web page along with the search results for that particular web page. The first feature enables users to navigate through Yioop's entire cache. When a cached page is displayed along with its contents, links to cached pages saved in the past are also displayed. The feature also enables users to navigate cache history based on year and month. This feature is similar in function to the Internet Archive as it maintains snapshots of the web taken at different times. The contents of a web page can change over time. Thus, a search engine caching web pages has to make sure that the cached pages are fresh. The second feature of this project implements cache validation using information obtained from web page headers. The cache validation mechanism is implemented using Entity Tags and Expires header. The cache validation feature is then tested for effect on crawl speed and savings in bandwidth.

Share

COinS