- Source name:
- Number of data revisions since 19 May 2020: 1
- Date of last change: 3 June 2020
- Available for: county, hrr, msa, state (see geography coding docs)
This data source is based on symptom surveys run by Carnegie Mellon. Facebook directs a random sample of its users to these surveys, which are voluntary. Individual survey responses are held by CMU and are sharable with other health researchers under a data use agreement. No individual survey responses are shared back to Facebook.
Of primary interest in these surveys are the symptoms defining a COVID-like illness (fever, along with cough, or shortness of breath, or difficulty breathing) or influenza-like illness (fever, along with cough or sore throat). Using this survey data, we estimate the percentage of people who have a COVID-like illness, or influenza-like illness, in a given location, on a given day.
||Estimated percentage of people with COVID-like illness based on the criteria below, with no smoothing or survey weighting|
||Estimated percentage of people with influenza-like illness based on the criteria below, with no smoothing or survey weighting|
||Estimated percentage of people with COVID-like illness; adjusted using survey weights as described below|
||Estimated percentage of people with influenza-like illness; adjusted using survey weights as described below|
||Estimated percentage of people reporting illness in their local community, as described below, including their household, with no smoothing or survey weighting|
||Estimated percentage of people reporting illness in their local community, as described below, not including their household, with no smoothing or survey weighting|
Note that for
raw_nohh_cmnty_cli, the illnesses
included are broader: a respondent is included if they know someone in their
raw_hh_cmnty_cli) or community with fever, along with sore
throat, cough, shortness of breath, or difficulty breathing. This does not
attempt to distinguish between COVID-like and influenza-like illness.
Along with the
raw_ signals, there are additional signals with names beginning
smoothed_. These estimate the same quantities as the above signals, but
are smoothed in time to reduce day-to-day sampling noise; see details
below. Because the smoothed signals combine information across
seven days, they have larger sample sizes and hence are available for more
counties and MSAs than the raw signals.
- Survey Questions
- ILI and CLI Indicators
- Survey Weighting
The survey starts with the following 5 questions:
- In the past 24 hours, have you or anyone in your household had any of the
following (yes/no for each):
- (a) Fever (100 °F or higher)
- (b) Sore throat
- (c) Cough
- (d) Shortness of breath
- (e) Difficulty breathing
- How many people in your household (including yourself) are sick (fever, along with at least one other symptom from the above list)?
- How many people are there in your household in total (including yourself)?
- What is your current ZIP code?
- How many additional people in your local community that you know personally are sick (fever, along with at least one other symptom from the above list)?
Beyond these 5 questions, there are also many other questions that follow in the survey, which go into more detail on symptoms and demographics. These are primarily of interest to researchers studying the social and economic effects of the pandemic, but could still be useful for forecasting purposes. The full survey can be found TODO. TODO Link to details on obtaining research access
As of mid-June 2020, the median number of Facebook survey responses per day, is about 72,000.
Influenza-like illness or ILI is a standard indicator, and is defined by the CDC as: fever along with sore throat or cough. From the list of symptoms from Q1 on our survey, this means a and (b or c).
COVID-like illness or CLI is not a standard indicator. Through our discussions with the CDC, we chose to define it as: fever along with cough or shortness of breath or difficulty breathing.
Symptoms alone are not sufficient to diagnose influenza or coronavirus infections, and so these ILI and CLI indicators are not expected to be unbiased estimates of the true rate of influenza or coronavirus infections. These symptoms can be caused by many other conditions, and many true infections can be asymptomatic. Instead, we expect these indicators to be useful for comparison across the United States and across time, to determine where symptoms appear to be increasing.
For a single survey, we are interested in the quantities:
- the number of people in the household with ILI;
- the number of people in the household with CLI;
- the number of people in the household.
Note that comes directly from the answer to Q3, but neither nor can be computed directly (because Q2 does not give an answer to the precise symptomatic profile of all individuals in the household, it only asks how many individuals have fever and at least one other symptom from the list).
We hence estimate and with the following simple strategy. Consider ILI, without a loss of generality (we apply the same strategy to CLI). Let be the answer to Q2.
- If the answer to Q1 does not meet the ILI definition, then we report .
- If the answer to Q1 does meet the ILI definition, then we report .
This can only “over count” (result in too large estimates of) the true and . For example, this happens when some members of the household experience ILI that does not also qualify as CLI, while others experience CLI that does not also qualify as ILI. In this case, for both and , our simple strategy would return the sum of both types of cases. However, given the extreme degree of overlap between the definitions of ILI and CLI, it is reasonable to believe that, if symptoms across all household members qualified as both ILI and CLI, each individual would have both, or neither—with neither being more common. Therefore we do not consider this “over counting” phenomenon practically problematic.
Let and be the number of people with ILI and CLI, respectively, over a given time period, and in a given location (for example, the time period being a particular day, and a location being a particular county). Let be the total number of people in this location. We are interested in estimating the true ILI and CLI percentages, which we denote by and , respectively:
We estimate and across 4 temporal-spatial aggregation schemes:
- daily, at the county level;
- daily, at the MSA (metropolitan statistical area) level;
- daily, at the HRR (hospital referral region) level;
- daily, at the state level.
Note that these spatial aggregations are possible as we have the ZIP code of the household from Q4 of the survey. Our current rule-of-thumb is to discard any estimate (whether at a county, MSA, HRR, or state level) that is based on fewer than 100 survey responses. When our geographical mapping data indicates that a ZIP code is part of multiple geographical units in a single aggregation, we assign weights to each of these units and proceed as described below, but with uniform participation weights ( for all ).
In a given temporal-spatial unit (for example, daily-county), let and denote number of ILI and CLI cases in the household, respectively (computed according to the simple strategy described above), and let denote the total number of people in the household, in survey , out of surveys we collected. Then our estimates of and (see Appendix below for motivating details) are:
Their estimated standard errors are:
the standard deviations of the estimators after adding a single pseudo-observation at 1/2 (treating as fixed). The use of the pseudo-observation prevents standard error estimates of zero, and in simulations improves the quality of the standard error estimates.
The pseudo-observation is not used in and themselves, to avoid potentially large amounts of estimation bias, as and are expected to be small.
Over a given time period, and in a given location, let be the number of people who know someone in their community with CLI, and let be the number of people who know someone in their community, outside of their household, with CLI. With denoting the number of people total in this location, we are interested in the percentages:
We will estimate and across the same 4 temporal-spatial aggregation schemes as before.
For a single survey, let:
- if and only if a positive number is reported for Q2 or Q5;
- if and only if a positive number is reported for Q2.
In a given temporal-spatial unit (for example, daily-county), let and denote these quantities for survey , and denote the number of surveys total. Then to estimate and , we simply use:
Hence is reported in the
hh_cmnty_cli signals and in
nohh_cmnty_cli signals. Their estimated standard errors are:
which are the plug-in estimates of the standard errors of the binomial proportions (treating as fixed).
Note that is the number of survey respondents who know someone in their community with either ILI or CLI, and not CLI alone; and similarly for . Hence and will generally overestimate and . However, given the extremely high overlap between the definitions of ILI and CLI, we do not consider this to be practically very problematic.
The smoothed versions of the signals described above (with
are calculated using seven day pooling. For example, the estimate reported for
June 7 in a specific geographical area (such as county or MSA) is formed by
collecting all surveys completed between June 1 and 7 (inclusive) and using that
data in the estimation procedures described above.
Notice that the estimates defined in last two subsections actually reflect the percentage of inviduals with ILI and CLI, and individuals who know someone with CLI, with respect to the population of US Facebook users. (To be precise, the estimates above actually reflect the percentage inviduals with ILI and CLI, with respect to the population of US Facebook users and their households members). In reality, our estimates are even further skewed by the varying propensity of people in the population of US Facebook users to take our survey in the first place.
When Facebook sends a user to our survey, it generates a random ID number and sends this to us as well. Once the user completes the survey, we pass this ID number back to Facebook to confirm completion, and in return receive a weight—call it for user . (To be clear, the random ID number that is generated is completely meaningless for any other purpose than receiving said weight, and does not allow us to access any information about the user’s Facebook profile.)
We can use these weights to adjust our estimates of the true ILI and CLI proportions so that they are representative of the US population—adjusting both for the differences between the US population and US Facebook users (according to a state-by-age-gender stratification of the US population from the 2018 Census March Supplement) and for the propensity of a Facebook user to take our survey in the first place.
In more detail, we receive a participation weight
where is an estimated probability (produced by Facebook) that an individual with the same state-by-age-gender profile as user would be a Facebook user and take our CMU survey, scaled by some unknown constant . The adjustment we make follows a standard inverse probability weighting strategy (this being a special case of importance sampling).
As before, for a given temporal-spatial unit (for example, daily-county), let and denote the numbers of ILI and CLI cases in household , respectively (computed according to the simple strategy above), and let denote the total number of people in the household. Let denote the surveys started during the time period of interest and reported in a ZIP code intersecting the spatial unit of interest.
Each of these surveys is assigned two weights: the participation weight , and a geographical-division weight describing how much a participant’s ZIP code “belongs” in the spatial unit of interest. (For example, a ZIP code may overlap with multiple counties, so the weight describes what proportion of the ZIP code’s population is in each county.)
Let denote the initial weight assigned to this survey. This is simply the weight provided to us by Facebook, rescaled with chosen so that .
First, the initial weights are adjusted to reduce sensitivity to any individual survey by “mixing” them with a uniform weighting across all relevant surveys. This prevents specific survey respondents with high survey weights having disproportionate influence on the weighted estimates.
Specifically, we select the smallest value of such that
for all . If such a selection is impossible, then we have insufficient survey responses (less than 100), and do not produce an estimate for the given temporal-spatial unit.
Then our adjusted estimates of and are:
with estimated standard errors:
which are the delta method estimates of variance associated with self-normalized importance sampling estimators above, after combining with a pseudo-observation of 1/2 with weight assigned to appear like a single effective observation according to importance sampling diagnostics.
The sample size reported is calculated by rounding down before adding the pseudo-observations. When ZIP codes do not overlap multiple spatial units of interest, these weights are all one, and this expression simplifies to . When estimates are available for all spatial units of a given type over some time period, the sum of the associated sample sizes under this definition is consistent with the number of surveys used to prepare the estimate. (This notion of sample size is distinct from “effective” sample sizes based on variance of the importance sampling estimators which were used above.)
As before, in a given temporal-spatial unit (for example, daily-county), let and denote the indicators that the survey respondent knows someone in their community with CLI, including and not including their household, respectively, for survey , out of surveys collected. Also let be the self-normalized weight that accompanies survey , as above. Then our adjusted estimates of and are:
with estimated standard errors:
the delta method estimates of variance associated with self-normalized importance sampling estimators.