Skip to Main Content

Text Mining

Accessing Social Media Data

The list below details the current availability of social media data for academic researchers. For all these sources, an API (Application Programming Interface) is the primary way to bulk download social media data, and typically requires coding expertise in Python and familiarity with JSON data formats. Some APIs require payment.

X/Twitter

Twitter/X data is available for paid users in the developer API platform. As of early 2023, the free program for academic research access has been discontinued, so most uses and projects require access to the paid Basic or Pro tier. Limited exceptions for EU-focused research are available for qualified researchers under Article 40 of the Digital Services Act . 

Attempts to use automated methods to scrape the site are explicitly prohibited by the Terms of Service. Manually compiling a smaller collection of publicly-available Tweets for analysis is permissible. 

Meta (Facebook/Instagram/Threads)

Free academic access to Meta data is mediated by the Meta Content Library and API in collaboration with the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan. Access to the Meta Content Library is by application only for qualified researchers. Depending on the data required, you can run analysis in their secure virtual environment or download data for local analysis. 

Attempts to use automated methods to scrape these sites are explicitly prohibited by the Terms of Service. Manually assembling a smaller corpus of publicly-available data for analysis is permissible (gathering data from private and/or personal accounts/groups is not permitted). 

YouTube

The YouTube Data API is available for free, rate-limited at 10,000 requests per day. The YouTube data API can be used to extract data from YouTube channels such at video metadata, captions, comments, live stream engagement, and more. 

YouTube's Terms of Service prohibit the use of automated screen scrapers to extract data. 

TikTok

Public access to TikTok's API is limited in the United States. Non-profit, academic researchers can be granted access by application. This process is primarily for faculty researchers, however PhD students may apply with a letter of endorsement from a professor.

TikTok's Terms of Service prohibit the use of automated screen scrapers to collect data. Researchers may manually collect a smaller corpus of publicly-available materials for analysis. 

Reddit

The Reddit Data API allows researchers to download the 2000 most recent posts from a subreddit. Queries can be built with PRAW, an easy-to-use Python wrapper. 

Reddit's Terms of Service bans most uses of automated screen scrapers to collect data. You can submit a request to Reddit explaining your research and request additional access to the site. 

The r/redditdev and r/reddit4researchers communities are good resources for understanding the current research landscape with Reddit data.

Tumblr

Registered users of the Tumblr API can utilize the API to download posts, tags, and other useful data from Tumblr. The Python API wrapper pytumblr can be used to simplify this process. 

Tumblr's Terms of Service prohibit the use of screen scrapers to collect data. 

Researchers must always do their due diligence, read and adhere to the Terms of Service for the data source, and comply with local laws around copyright, data collection, and privacy. Check out the Web Scraping a Corpus page of this guide for more information. 

Free Sources of Social Media Data

Important Considerations for Text Mining Social Media Data

Social media data can contain Personally Identifiable Information, or PII. Inappropriate use or sharing of PII can cause harm to the people it represents. Always use caution when text mining personal information and take actions to mitigate harm.

Additionally, social media data is often only accessible via a password authentication system, or only available for members of a certain group or online community. Information that requires authentication or approval is considered private, and thus should not be used as part of a text mining project.

Researchers must always do their due diligence, read and adhere to the Terms of Service for the data source, and comply with local laws around copyright, data collection, and privacy. Check out the Web Scraping a Corpus page of this guide for more information. 

Creative Commons   This work is licensed under a Creative Commons Attribution NonCommercial 4.0 International License. | Details of our policy