Publication Date

Fall 2021

Degree Type

Master's Project

Degree Name

Master of Science (MS)

Department

Computer Science

First Advisor

Chris Pollett

Second Advisor

Robert Chun

Third Advisor

Akshay Kajale

Keywords

High Performance Document Store, Linear Hashing, Hash-Table, Buckets, WebArchive, Rust, JavaScript, Python, Data Storage, Data Retrieval

Abstract

Databases are a core part of any application which requires persistence of data. The performance of applications involving the use of database systems is directly proportional to how fast their database read-write operations are. The aim of this project was to build a high- performance document store which can support variety of applications which require data storage and retrieval of some kind. This document store can be used as an independently running backend service which can be utilized by search engines, applications which deal with keeping records, etc. We used Rust to make this document store which is fast, robust, and memory efficient. The document store is a server which can return documents based on the key provided in the request. It has the capability to read WebArchive (warc extension) files as a feed to the linear hashing based datastore. The paper focuses on the implementation of this application and how it is a relevant backend system for a search engine.

We performed various tests for the functionalities that are offered by the application and documented the noteworthy results. The linear hash-table has the capability to insert 10,000 records with key-value pairs sized 16 bytes in 10 seconds where a similar implementation in JavaScript takes around 40 seconds for the same. The insertion time in our implementation increases logarithmically. The hash-table supports retrieval of 10,000 similar sized records in under 5 seconds. The WebArchive parser utility supports the parsing of 10,000 records of a compressed (gzip extension) warc file in an average of 70 seconds. This is approximately the same as the warcio library in Python. This time increases linearly with the number of records that are read, in our application. Along with the conversion of warc records into a format suitable for the linear hash-table, it takes the application an average of 80 seconds. The warc utility also has the ability to write warc files. It can write 10,000 warc files with a body of size around 60bytes in an average of 2 seconds. Detailed performance comparisons with other similar tools are also documented in the paper.

Share

COinS