Often one of the most time consuming parts of text mining is finding access to good digital copies of the text you are interested in order to use them in text mining tools. Ideally, you want .txt files of the text you are interested. Typically this is achieved through scanning and OCR (Optical Character Recognition) if you're interested in a physical text. If the text you're interested in is digital, you may have less work, but often digital texts must be reformatted and/or OCR-ed. Be sure to budget plenty of time into your project plan to collect and correctly format your texts before you can begin text mining.
For academic research, text mining is permitted under the "fair use" principle, provided you do not attempt to pass off anyone else's work as your own, or recreate large parts of a text within your work. More information on text and data mining and fair use can be found in this paper from the Association of Research Libraries. Questions about fair use and copyright? Check out the Library's copyright guide or contact librarycopyright@georgetown.edu.
Depending on your area of interest, the text you’re interested in may have already been collected, digitized, and formatted correctly and then made available to other scholars. Popular collections of corpora like these are listed below:
This work is licensed under a Creative Commons Attribution NonCommercial 4.0 International License. | Details of our policy