Guides: Text Mining : Home

What is Text Mining?

Text mining, sometimes also known as text data mining (TDM), is the process of extracting information from a collection of texts. The type of texts used, and the type of information sought from the texts vary widely across projects. Some examples include tracking an author's word usage across their entire body of work, finding the extent of commonality between two government documents, or finding the most commonly used words in the transcript of an earnings call from a large company.

Some examples of text analysis are:

Word Frequencies - look at counts of key words across your corpus
Concordance - examine a key word in context
Collocation - analyzing words that commonly appear in close proximity
Sentiment analysis - determining emotional tone of texts
Named entity recognition - identifying and classifying known objects into categories
Question answering - NLP model where questions are answered from a model
Significant terms - identifying terms with high frequency in one section of your corpus
Summarization - Can you create an abstract for this document?

Source: Adapted from materials by Nathan Kelber for ITHAKA

This guide will highlight some popular and easy to use tools for text mining as well as some popular corpora of texts.

Steps for Text Mining

In general, text mining requires these steps to analyze texts:

Find the texts you want to work with
Ensure texts are machine readable, scan and OCR if needed
Convert texts to appropriate file formats (.txt if possible)
Compile texts into one folder on your computer
Load texts into your tool of choice

Steps 1-4 describe the methods you may take in assembling a corpus. A corpus (pl. corpora) is defined as "all text used on an empirical selected case study, to be further processed by methods of linguistic analysis" (IGI Global InfoScipedia).

What is Machine Learning (ML)?

Machine Learning is a branch of Artificial Intelligence (AI) and computer science that can be used to build text data analysis models that mimic the way humans dissect text. Many of the tools in this LibGuide can also be used to build a machine learning model, or use an existing machine learning model to categorize words (ie. sentiment analysis) and extract meaning from words (ie. named entity recognition). The AI models used in machine learning differ from Generative Large Language Models (LLMs), as they are more specialized and are targeted for specific tasks. Any output from an AI or ML model should be reviewed for accuracy, and a researcher building an AI or ML model should be prepared to test and refine their model to ensure it answers their research question.

This article from Sage Research Methods gives a comprehensive foundational overview of machine learning for beginners.