Skip to Main Content

Text Mining

Finding Texts

Often one of the most time consuming parts of text mining is finding access to good digital copies of the text you are interested in order to use them in text mining tools. Ideally, you want .txt files of the text you are interested. Typically this is achieved through scanning and OCR (Optical Character Recognition) if you're interested in a physical text. If the text you're interested in is digital, you may have less work, but often digital texts must be reformatted and/or OCR-ed. Be sure to budget plenty of time into your project plan to collect and correctly format your texts before you can begin text mining. 

Depending on your area of interest, the text you’re interested in may have already been collected, digitized, and formatted correctly and then made available to other scholars. Popular collections of corpora like these are listed below: 

Collections of Open Texts

Creative Commons   This work is licensed under a Creative Commons Attribution NonCommercial 4.0 International License. | Details of our policy