Text mining, sometimes also known as text data mining (TDM), is the process of extracting information from a collection of texts. The type of texts used, and the type of information sought from the texts vary widely across projects. Some examples include tracking an author's word usage across their entire body of work, finding the extent of commonality between two government documents, or finding the most commonly used words in the transcript of an earnings call from a large company.
Some examples of text analysis are:
Source: Adapted from materials by Nathan Kelber for ITHAKA
This guide will highlight some popular and easy to use tools for text mining as well as some popular corpora of texts.
In general, text mining requires these steps to analyze texts:
Steps 1-4 describe the methods you may take in assembling a corpus. A corpus (pl. corpora) is defined as "all text used on an empirical selected case study, to be further processed by methods of linguistic analysis" (IGI Global InfoScipedia).
Machine Learning is a branch of Artificial Intelligence (AI) and computer science that can be used to build text data analysis models that mimic the way humans dissect text. Many of the tools in this LibGuide can also be used to build a machine learning model, or use an existing machine learning model to categorize words (ie. sentiment analysis) and extract meaning from words (ie. named entity recognition). The AI models used in machine learning differ from Generative Large Language Models (LLMs), as they are more specialized and are targeted for specific tasks. Any output from an AI or ML model should be reviewed for accuracy, and a researcher building an AI or ML model should be prepared to test and refine their model to ensure it answers their research question.
This article from Sage Research Methods gives a comprehensive foundational overview of machine learning for beginners.
This work is licensed under a Creative Commons Attribution NonCommercial 4.0 International License. | Details of our policy