Charles J. Bocage, San Jose State University


Automatic text summarization is the ability to obtain key ideas from a text passage using as few words as possible. With the increase in data on the web, manual summarization of web pages has become unfeasible, and the need for automatic text summarization has become ever greater. This project explored and implemented various parts of the automatic text summarization process for an open source search engine, Yioop. These parts included stemming, text segmentation, term frequency weighting, automatic sentence compression, and content management system detection.

In addition, experiments were conducted on different pre-existing Yioop summarizers. These results served as a baseline for comparison with results obtained from two new ways to generate summaries which we implemented: A graph based approach and an average sentence approach. Summaries were evaluated using Recall-Oriented Understudy for Gisting Evaluation (ROUGE). Analyzing the ROUGE results of each summarizer showed that the new summarizers did not produce better summaries than Yioop’s pre-existing summarizers. During the course of conducting these experiments, it was noted that the location of useful information on a web page could often be obtained if one could determine the content management system that created the web page. An extensible detector for the content management system was written for the Yioop search engine. ROUGE results using this system were recomputed for the various summarizers. Using the content management system detector resulted in a ten to twenty percent increase in ROUGE scores across various page experiments.