Guides: Text Mining : Free Access Corpora

Finding Texts

Often one of the most time consuming parts of text mining is finding access to good digital copies of the text you are interested in order to use them in text mining tools. Ideally, you want .txt files of the text you are interested. Typically this is achieved through scanning and OCR (Optical Character Recognition) if you're interested in a physical text. If the text you're interested in is digital, you may have less work, but often digital texts must be reformatted and/or OCR-ed. Be sure to budget plenty of time into your project plan to collect and correctly format your texts before you can begin text mining.

For academic research, text mining is permitted under the "fair use" principle, provided you do not attempt to pass off anyone else's work as your own, or recreate large parts of a text within your work. More information on text and data mining and fair use can be found in this paper from the Association of Research Libraries. Questions about fair use and copyright? Check out the Library's copyright guide or contact librarycopyright@georgetown.edu.

Depending on your area of interest, the text you’re interested in may have already been collected, digitized, and formatted correctly and then made available to other scholars. Popular collections of corpora like these are listed below:

Collections of Open Texts

Chronicling America
Chronicling America provides free access to millions of historic American newspaper pages. Coverage is between 1777-1963.
HathiTrust Digital Library
The Georgetown University Library is a member of HathiTrust, a partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the world. Georgetown users can log into HathiTrust's digital library to download millions of titles for free.
Project Gutenberg
Project Gutenberg provides free access to over 75,000 eBooks. EBooks can be downloaded as plain text files, enabling easy text mining with a range of tools.
University of Oxford Text Archive
The University of Oxford Text Archive (OTA) is a repository of digital literary and linguistic resources for research and teaching. Users can download texts as XML files, plain text files, or view the text in Voyant with the direct link provided in each entry of the OTA catalog.