Skip to Main Content

Text Mining

What is Text Mining?

Text mining, sometimes also known as text data mining (TDM), is the process of extracting information from a collection of texts. The type of texts used, and the type of information sought from the texts vary widely across projects. Some examples include tracking an author's word usage across their entire body of work, finding the extent of commonality between two government documents, or finding the most commonly used words in the transcript of an earnings call from a large company. 

Some examples of text analysis are:

  • Word Frequencies - look at counts of key words across your corpus
  • Concordance - examine a key word in context
  • Collocation - analyzing words that commonly appear in close proximity
  • Sentiment analysis - determining emotional tone of texts
  • Named entity recognition - identifying and classifying known objects into categories
  • Question answering - NLP model where questions are answered from a model
  • Significant terms - identifying terms with high frequency in one section of your corpus
  • Summarization - Can you create an abstract for this document?

Source: Adapted from materials by Nathan Kelber for ITHAKA

This guide will highlight some popular and easy to use tools for text mining as well as some popular corpora of texts.

Steps for Text Mining

In general, text mining requires these steps to analyze texts:

  1. Find the texts you want to work with
  2. Ensure texts are machine readable, scan and OCR if needed
  3. Convert texts to appropriate file formats (.txt if possible)
  4. Compile texts into one folder on your computer
  5. Load texts into your tool of choice

Steps 1-4 describe the methods you may take in assembling a corpus. A corpus (pl. corpora) is defined as "all text used on an empirical selected case study, to be further processed by methods of linguistic analysis" (IGI Global InfoScipedia). 

What is Machine Learning (ML)?

Machine Learning is a branch of Artificial Intelligence (AI) and computer science that can be used to build text data analysis models that mimic the way humans dissect text. Many of the tools in this LibGuide can also be used to build a machine learning model, or use an existing machine learning model to categorize words (ie. sentiment analysis) and extract meaning from words (ie. named entity recognition). The AI models used in machine learning differ from Generative Large Language Models (LLMs), as they are more specialized and are targeted for specific tasks. Any output from an AI or ML model should be reviewed for accuracy, and a researcher building an AI or ML model should be prepared to test and refine their model to ensure it answers their research question. 

This article from Sage Research Methods gives a comprehensive foundational overview of machine learning for beginners. 

 

Creative Commons   This work is licensed under a Creative Commons Attribution NonCommercial 4.0 International License. | Details of our policy