Faculty Research, Scholarly, and Creative Activity

How to Collect a Corpus of Websites With a Web Crawler

James A. Hodges, San José State UniversityFollow

Publication Date

7-12-2022

Document Type

Contribution to a Book

Publication Title

SAGE Research Methods: Doing Research Online

Editor

Morgan Currie

DOI

10.4135/9781529609325

Abstract

Conducting research on digital cultures often requires some form of reference to online sources—but online sources are constantly changing, being updated, or deleted on a minute-by-minute basis. This guide will introduce the use of web crawlers as one potential method for gathering a stable, trustworthy collection of online sources. A corpus of sources generated via a web crawler can function as a detailed snapshot of the way an online resource existed at a particular point in time. The guide begins with an introduction to the theory behind web crawling, before moving into discussions of ethical concerns and commonly used tools. After addressing each of these foundational areas, the guide concludes with a step-by-step demonstration of web crawling with the popular command-line based open-source web downloading tool known as Wget.

Keywords

crawling, search engines, software, web sites

Department

Information

Recommended Citation

James A. Hodges. "How to Collect a Corpus of Websites With a Web Crawler" SAGE Research Methods: Doing Research Online (2022). https://doi.org/10.4135/9781529609325

Link to Full Text

COinS

Faculty Research, Scholarly, and Creative Activity

How to Collect a Corpus of Websites With a Web Crawler

Publication Date

Document Type

Publication Title

Editor

DOI

Abstract

Keywords

Department

Recommended Citation

Search

Browse All

Links

Faculty Research, Scholarly, and Creative Activity

How to Collect a Corpus of Websites With a Web Crawler

Authors

Publication Date

Document Type

Publication Title

Editor

DOI

Abstract

Keywords

Department

Recommended Citation

Share

Search

Browse All

Links