Text mining, sometimes also known as text data mining (TDM), is the process of extracting information from a collection of texts. The type of texts used, and the type of information sought from the texts vary widely across projects. Some examples include tracking an author's word usage across their entire body of work, finding the extent of commonality between two government documents, or finding the most commonly used words in the transcript of an earnings call from a large company.
This guide will highlight some popular and easy to use tools for text mining as well as some popular corpora of texts.
In general, text mining requires these steps to analyze texts:
Steps 1-4 describe the methods you may take in assembling a corpus. A corpus (pl. corpora) is defined as "all text used on an empirical selected case study, to be further processed by methods of linguistic analysis" (IGI Global InfoScipedia).
This work is licensed under a Creative Commons Attribution NonCommercial 4.0 International License. | Details of our policy