Twitter Stream
| Attribute | Details |
|---|---|
| Source Name | twitter |
| Data Source | HealthTweets |
| Geographic Levels | National, Department of Health & Human Services (HHS) Regions, Census Divisions, State (see Geographic Codes) |
| Temporal Granularity | Daily and Weekly (Epiweek) |
| Reporting Cadence | Inactive - No longer updated since 2020w31 (2020-12-07) |
| Temporal Scope Start | 2011w48 (2011-11-27) |
Overview
This data source provides estimates of influenza activity derived from the content of public Twitter posts. The data was processed by HealthTweets.org using natural language processing (NLP) to classify tweets as flu-related.
General topics not specific to any particular endpoint are discussed in the API overview. Such topics include: contributing, citing, and data licensing.
Note: Restricted access: This endpoint requires authentication.
Table of contents
Estimation
The classification and processing pipeline involves several stages to transform raw Twitter data into health trends:
- Data Collection: Two streams are collected via the Twitter Streaming API:
- HEALTH Stream: Capped at 1% of public tweets, filtered using 269 health-related keywords.
- SAMPLE Stream: A random 1% sample of all public tweets.
- Classification: A statistical classifier identifies health-related tweets within the HEALTH stream (estimated F1-score of 0.70). These are further processed to distinguish actual influenza reports from general awareness or concern.
- Geolocation: Every identified health tweet and every tweet from the SAMPLE stream is geolocated using Carmen, which resolves location down to the city level using profile data and geotags.
- Normalization: The volume of identified influenza infections (
num) is normalized by the total volume of tweets from the same location in the SAMPLE stream (total) to calculate the prevalence (percent). - Gap Filling: Missing data (e.g., due to network interruptions) is estimated based on adjacent days.
For more technical details, see the research paper below.
Limitations
- Highly dependent on Twitter’s API access and terms of service, which have changed significantly.
- Twitter users are not a representative sample of the general population.
The API
The base URL is: https://api.delphi.cmu.edu/epidata/twitter/
Parameters
Required
| Parameter | Description | Type |
|---|---|---|
auth |
password | string |
locations |
locations | list of location codes: nat, HHS regions, Census divisions, or state codes (see Geographic Codes) |
dates |
dates (see Date Formats) | list of dates |
epiweeks |
epiweeks (see Date Formats) | list of epiweeks |
Note: Only one of
datesandepiweeksis required. If both are provided,epiweeksis ignored.
Response
| Field | Description | Type |
|---|---|---|
result |
result code: 1 = success, 2 = too many results, -2 = no results | integer |
epidata |
list of results | array of objects |
epidata[].location |
location label | string |
epidata[].date |
date (yyyy-MM-dd) | string |
epidata[].epiweek |
epiweek | integer |
epidata[].num |
number of flu-related tweets in the HEALTH stream (see Methodology) | integer |
epidata[].total |
total number of tweets in the random SAMPLE stream for the same location | integer |
epidata[].percent |
flu-related tweets normalized based on the number | |
| of tweets in SAMPLE | float | |
message |
success or error message |
string |
Example URLs
Twitter on 2015w01 (national)
https://api.delphi.cmu.edu/epidata/twitter/?auth=...&locations=nat&epiweeks=201501
{
"result": 1,
"epidata": [
{
"location": "nat",
"num": 3067,
"total": 443291,
"epiweek": 201501,
"percent": 0.6919
}
],
"message": "success"
}
Citing the Survey
Researchers who use the Twitter Stream data for research are asked to credit and cite the survey in publications based on the data. Specifically, we ask that you cite our paper describing the survey:
Mark Dredze, Renyuan Cheng, Michael J Paul, David A Broniatowski. HealthTweets.org: A Platform for Public Health Surveillance using Twitter. AAAI Workshop on the World Wide Web and Public Health Intelligence, 2014.
Code Samples
Libraries are available for R and Python.
The following samples show how to import the library and fetch Twitter data for national level for epiweek 201501.
Install the package using pip:
pip install -e "git+https://github.com/cmu-delphi/epidatpy.git#egg=epidatpy"
# Import
from epidatpy import CovidcastEpidata, EpiDataContext, EpiRange
# Fetch data
epidata = EpiDataContext()
res = epidata.pvt_twitter(auth='auth_token', locations=['nat'], time_type="week", time_values=[201501])
print(res)
library(epidatr)
# Fetch data
res <- pvt_twitter(auth = 'auth_token', locations = 'nat',
time_type = "week", time_values = 201501)
print(res)
Legacy Clients
We recommend using the modern client libraries mentioned above. Legacy clients are also available for Python, R, and JavaScript.
Optionally install the package using pip(env):
pip install delphi-epidata
Otherwise, place delphi_epidata.py from this repo next to your python script.
# Import
from delphi_epidata import Epidata
# Fetch data
res = Epidata.twitter('auth_token', ['nat'], time_type="week", time_values=[201501])
print(res['result'], res['message'], len(res['epidata']))
Place delphi_epidata.R from this repo next to your R script.
source("delphi_epidata.R")
# Fetch data
res <- Epidata$twitter(auth = "auth_token", locations = list("nat"), time_type = "week", time_values = list(201501))
print(res$message)
print(length(res$epidata))
<!-- Imports -->
<script src="delphi_epidata.js"></script>
<!-- Fetch data -->
<script>
EpidataAsync.twitter('auth_token', ['nat'], EpidataAsync.range(201501, 201510)).then((res) => {
console.log(res.result, res.message, res.epidata != null ? res.epidata.length : 0);
});
</script>