Guides: Text Mining : Web Scraping a Corpus

What is web scraping?

Web scraping is defined as "the extraction of desired information from web pages for processing or any use" (IGI Global InfoScipedia). Web scraping can be automated with a coding tool, or manually done by a researcher.

Web scraping is one technique that can be used to create a corpus of textual data for analysis. When scraping the web, there are a number of different methods that can be used, and a number of ethical and legal considerations. See below for information on ethics and methods.

Important Factors to Consider in Web Scraping

Note: This page is for informational purposes and does not constitute legal advice. Researchers who seek to do web scraping are responsible for adhering to local laws around copyright, data collection, and privacy, as well as adhere to the Terms and Conditions of any website they seek to scrape.

When web scraping, there are three main ethical principles to consider:

1) Don't break the web

When web scraping, particularly when writing a custom web scraper, you must be mindful of how many times you are querying a particular website to access a large number of pages. A web scraper can overburden a web server and cause the website to slow significantly, or crash, limiting the access other "normal" users have to the web site. These methods have been used by hackers for nefarious purposes, called a "Denial of Service Attack". Even if your web scraper is legitimate and not being used for hacking, an excessive volume of queries can cause you to get banned from accessing a website, as your web scraping is flagged as suspicious, or is causing significant burden to servers.

Scrapy, a popular Python package for building scrapers, does have a setting to minimize these risks. Additionally, you can take actions like slowing/delaying queries, or running your scraper during off-peak hours.

2) Don't steal

In web scraping, you must always consider the copyright and privacy of any information you are gathering. For academic research, many cases of scraping and mining copyrighted information are permitted under the "fair use" principle, provided you do not attempt to pass off anyone else's work as your own, or recreate large parts of a text within your work. More information on text and data mining and fair use can be found in this paper from the Association of Research Libraries. Questions about fair use and copyright? Check out the Library's copyright guide or contact librarycopyright@georgetown.edu.

If you are attempting to mine resources through the Library, or any other website (particularly any site that requires payment, login, and/or a subscription), you are bound to the terms of use for that site, which may prohibit scraping and/or mining.

Additional considerations should also be made when mining private information. In general, anything on the web that is only accessible behind a password authentication system is not considered publicly available. This includes social media sites (which also can contain large amounts of Personally Identifiable Information). Be cautious and consider the ethics and legality if you seek to scrape private information or Personally Identifiable Information, as the terms and conditions of these sites, and local legislation, vary widely. Better be safe than sorry - many of these laws and policies are currently undergoing large changes, so always seek out current information. The Association of Internet Researchers has additional detailed guidance for ethics and privacy on their website.

Finally, members of the Georgetown community are bound to the university Computer Systems Acceptable Use Policy, which outlines principles of respect and responsibility when using computing and information technology tools at the university.

3) Be nice

Before undertaking a web scraping project, you can always contact the owner of the site asking if they have your relevant data available, or to clarify the rules they have for the site - it can't hurt to ask! Depending on the scale of your project, and if your research interests align with the website owner's, there may be the option to get the data directly from the owner(s) in an accessible, structured format from the back end of the website. Data sharing and collaboration can be mutually beneficial for all parties involved.

Source: UCSB Carpentry, Ethics & Legality of Web Scraping

Web Scraping Tools

There are many methods one can use to scrape information from the web. Generally these are divided into three categories, with varying levels of required technical expertise:

1) Manual/Browser-assisted (no-code)

At its core, automated web scraping methods have the same mechanics as going to a website and copy-pasting out the information you wish to analyze (just at a large scale, with quick and automated methods). Thus, you can always scrape and create a smaller corpus manually by navigating the publicly available website and copy-pasting relevant information. Some browser extensions can help expedite this process, such as Webscraper.io and Instant Data Scraper.

2) Querying APIs (low-code)

APIs (Application Programming Interface) are an interface for applications to share data with each other. Many large websites and social media applications have APIs that expedite sharing and downloading data in a structured format, all with the explicit permission of the website owner. You can find out if a website has an API by navigating to the "developer" section of a website. Querying an API can be done in Python or R, sometimes with packages specially designed to parse the data structure specific to that API. Thus, using APIs requires some coding experience, but not the expertise required to build a custom scraper. APIs require registration with the website to get your unique API key. APIs are typically run on a fee-based structure, however, some do have free tiers for hobbyists and academic researchers.

Note that many social media APIs are currently undergoing restructuring changes, so make sure you are looking at the most up-to-date documentation for the API. Additional information on the current landscape for APIs, data, and academic research can be found here.

3) Building a custom scraper (coding expertise required)

If a website doesn't have an API, or if you cannot acquire your data through copy-pasting or asking the website owner, then you could consider building a scraper. Scrapers can be built with R or Python. See below for some resources and common packages for building web scrapers and analyzing text data:

Text Mining with R (rvest, quanteda, koRpus, spacyr, tidytext, tokenizer, text2vec, ida, STM, topicmodels, SentimentAnalysis, cleanNLP)

Applied Text Analysis with Python (Scrapy, BeautifulSoup, NLTK, Scattertext, SpaCy, TextBlob)

In addition to expertise in R or Python, building a scraper also requires some knowledge of the language the web page is built in (ie. HTML/CSS, JavaScript) in order to tell the web scraper where to look for information on the web page.

With all these methods, you must adhere to local laws and website Terms and Conditions.

Practice Sites

Want to practice scraping in a simple, low-stakes environment? These sites are great sandboxes to practice: