Skip to Main Content

Text Mining

What is Text Mining?

Text mining, sometimes also known as text data mining (TDM), is the process of extracting information from a collection of texts. The type of texts used, and the type of information sought from the texts vary widely across projects. Some examples include tracking an author's word usage across their entire body of work, finding the extent of commonality between two government documents, or finding the most commonly used words in the transcript of an earnings call from a large company. 

This guide will highlight some popular and easy to use tools for text mining as well as some popular corpora of texts.

Steps for Text Mining

In general, text mining requires these steps to analyze texts:

  1. Find the texts you want to work with
  2. Ensure texts are machine readable, scan and OCR if needed
  3. Convert texts to appropriate file formats (.txt if possible)
  4. Compile texts into one folder on your computer
  5. Load texts into your tool of choice

Steps 1-4 describe the methods you may take in assembling a corpus. A corpus (pl. corpora) is defined as "all text used on an empirical selected case study, to be further processed by methods of linguistic analysis" (IGI Global InfoScipedia). 

Creative Commons   This work is licensed under a Creative Commons Attribution NonCommercial 4.0 International License. | Details of our policy