Getting Started

Overview

This package provides access to data from the COVIDcast API, which provides numerous COVID-related data streams, updated daily. The data is retrieved live from the server when you make a request, and is not stored within the package itself. This means that each time you make a request, you will be receiving the latest data available. If you are conducting an analysis or powering a service which will require repeated access to the same signal, please download the data rather than making repeated requests.

Installation

This package is available on PyPI as covidcast, and can be installed using pip or your favorite Python package manager:

pip install covidcast

This will install the package as well as all required dependencies.

Signal Overview

The API documentation lists the available signals, including many not shown on the COVIDcast interactive map.

The data come from a variety of sources and cover information including official case counts, internet search trends, hospital encounters, survey responses, and more. Below is a brief overview of each source, with links to their full descriptions.

To specify a signal, you will need the “Source Name” and “Signal” value, which are listed for each source/signal combination on their respective page. For example, to obtain the raw Google search volume for COVID-related topics, the source would be ght and the signal would be raw_search, as shown on the Google Health Trends page. These values will be provided as the arguments for the covidcast.signal() function to retrieve the desired data.

Basic examples

To obtain smoothed estimates of COVID-like illness from our symptom survey, distributed through Facebook, for every county in the United States between 2020-05-01 and 2020-05-07:

>>> from datetime import date
>>> import covidcast
>>> data = covidcast.signal("fb-survey", "smoothed_cli",
...                         date(2020, 5, 1), date(2020, 5, 7),
...                         "county")
>>> data.head()
   geo_value      issue  lag  sample_size    stderr time_value     value
0      01000 2020-05-23   22    1722.4551  0.125573 2020-05-01  0.823080
1      01001 2020-05-23   22     115.8025  0.800444 2020-05-01  1.261261
2      01003 2020-05-23   22     584.3194  0.308680 2020-05-01  0.665129
3      01015 2020-05-23   22     122.5577  0.526590 2020-05-01  0.574713
4      01031 2020-05-23   22     114.8318  0.347450 2020-05-01  0.408163

Each row represents one observation in one county on one day. The county FIPS code is given in the geo_value column, the date in the time_value column. Here value is the requested signal—in this case, the smoothed estimate of the percentage of people with COVID-like illness, based on the symptom surveys. stderr is its standard error. The issue column indicates when this data was reported; in this case, the survey estimates for May 1st were updated on May 23rd based on new data, giving a lag of 22 days. See the covidcast.signal() documentation for details on the returned data frame.

Note

By default, this package submits queries to the API anonymously. If you have an API key, you can use it with this package by calling covidcast.use_api_key(), then call fetch functions as normal:

>>> covidcast.use_api_key("your_api_key")
>>> data = covidcast.signal("fb-survey", "smoothed_cli",
...                         date(2020, 5, 1), date(2020, 5, 7),
...                         "county")

The API documentation lists each available signal and provides technical details on how it is estimated and how its standard error is calculated. In this case, for example, the symptom surveys documentation page explains the definition of “COVID-like illness”, links to the exact survey text, and describes the mathematical derivation of the estimates.

We can also request all data on a signal after a specific date. Here, for example, we obtain smoothed_cli in each state for every day since 2020-05-01:

>>> data = covidcast.signal("fb-survey", "smoothed_cli",
...                         date(2020, 5, 1), geo_type="state")
>>> data.head()
   geo_value      issue  lag  sample_size    stderr time_value     value
0         ak 2020-05-23   22    1606.0000  0.158880 2020-05-01  0.460772
1         al 2020-05-23   22    7540.2437  0.082553 2020-05-01  0.699511
2         ar 2020-05-23   22    4921.4827  0.103651 2020-05-01  0.759798
3         az 2020-05-23   22   11220.9587  0.061794 2020-05-01  0.566937
4         ca 2020-05-23   22   51870.1382  0.022803 2020-05-01  0.364908

Using the geo_values argument, we can request data for a specific geography, such as the state of Pennsylvania for the month of May 2020:

>>> pa_data = covidcast.signal("fb-survey", "smoothed_cli",
...                            date(2020, 5, 1), date(2020, 5, 31),
...                            geo_type="state", geo_values="pa")
>>> pa_data.head()
   geo_value      issue  lag  sample_size    stderr time_value     value
0         pa 2020-05-23   22   31576.0165  0.030764 2020-05-01  0.400011
0         pa 2020-05-23   21   31344.0168  0.030708 2020-05-02  0.394774
0         pa 2020-05-23   20   30620.0162  0.031173 2020-05-03  0.396340
0         pa 2020-05-23   19   30419.0163  0.029836 2020-05-04  0.357501
0         pa 2020-05-23   18   29245.0172  0.030176 2020-05-05  0.354521

We can request multiple states by providing a list, such as ["pa", "ny", "mo"].

Sometimes it may be useful to join multiple signals into a single data frame. For example, suppose I’d like to look at the relationships between cases at each location and the number of deaths three days later. The covidcast.aggregate_signals() function can combine multiple data frames into a single one, optionally with lag. In this case, I use it as follows:

>>> cases = covidcast.signal("indicator-combination", "confirmed_incidence_num",
...                          date(2020, 5, 1), date(2020, 5, 31),
...                          geo_type="state", geo_values="pa")
>>> deaths = covidcast.signal("indicator-combination", "deaths_incidence_num",
...                           date(2020, 5, 1), date(2020, 5, 31),
...                           geo_type="state", geo_values="pa")
>>> cases_v_deaths = covidcast.aggregate_signals([cases, deaths], dt=[3, 0])
>>> cases_v_deaths = cases_v_deaths.rename(
...     columns={"indicator-combination_confirmed_incidence_num_0_value": "cases",
...              "indicator-combination_deaths_incidence_num_1_value": "deaths"})
>>> cases_v_deaths[["time_value", "geo_value", "cases", "deaths"]].head()
  time_value geo_value   cases  deaths
0 2020-05-01        pa     NaN    62.0
1 2020-05-02        pa     NaN    65.0
2 2020-05-03        pa     NaN    24.0
3 2020-05-04        pa  1209.0    13.0
4 2020-05-05        pa  1332.0   547.0

The resulting cases_v_deaths data frame contains one row per location per day. The death value is the number of deaths on that day; the cases value is the number of cases 3 days prior, matching the dt provided to covidcast.aggregate_signals(). The first three case values shown above are NaN because the input data frame did not contain case numbers for late April.

Note the long column names used by default to prevent ambiguity or name collisions.

Tracking issues and updates

The COVIDcast API records not just each signal’s estimate for a given location on a given day, but also when that estimate was made, and all updates to that estimate.

For example, consider using our doctor visits signal, which estimates the percentage of outpatient doctor visits that are COVID-related, and consider a result row with time_value 2020-05-01 for geo_values = "pa". This is an estimate for the percentage in Pennsylvania on May 1, 2020. That estimate was issued on May 5, 2020, the delay being due to the aggregation of data by our source and the time taken by the COVIDcast API to ingest the data provided. Later, the estimate for May 1st could be updated, perhaps because additional visit data from May 1st arrived at our source and was reported to us. This constitutes a new issue of the data.

By default, covidcast.signal() fetches the most recent issue available. This is the best option for users who simply want to graph the latest data or construct dashboards. But if we are interested in knowing when data was reported, we can request specific data versions.

First, we can request the data that was available as of a specific date, using the as_of argument:

>>> covidcast.signal("doctor-visits", "smoothed_cli",
...                  start_day=date(2020, 5, 1), end_day=date(2020, 5, 1),
...                  geo_type="state", geo_values="pa",
...                  as_of=date(2020, 5, 7))
   geo_value      issue  lag sample_size stderr time_value    value
0         pa 2020-05-07    6        None   None 2020-05-01  2.32192

This shows that an estimate of about 2.3% was issued on May 7. If we don’t specify as_of, we get the most recent estimate available:

>>> covidcast.signal("doctor-visits", "smoothed_cli",
...                  start_day=date(2020, 5, 1), end_day=date(2020, 5, 1),
...                  geo_type="state", geo_values="pa")
   geo_value      issue  lag sample_size stderr time_value     value
0         pa 2020-07-04   64        None   None 2020-05-01  5.075015

Note the substantial change in the estimate, to over 5%, reflecting new data that became available after May 7 about visits occurring on May 1. This illustrates the importance of issue date tracking, particularly for forecasting tasks. To backtest a forecasting model on past data, it is important to use the data that would have been available at the time, not data that arrived much later.

By using the issues argument, we can request all issues in a certain time period:

>>> covidcast.signal("doctor-visits", "smoothed_cli",
...                  start_day=date(2020, 5, 1), end_day=date(2020, 5, 1),
...                  geo_type="state", geo_values="pa",
...                  issues=(date(2020, 5, 1), date(2020, 5, 15)))
   geo_value      issue  lag sample_size stderr time_value     value
0         pa 2020-05-05    4        None   None 2020-05-01  1.693061
1         pa 2020-05-06    5        None   None 2020-05-01  2.524167
2         pa 2020-05-07    6        None   None 2020-05-01  2.321920
3         pa 2020-05-08    7        None   None 2020-05-01  2.897032
4         pa 2020-05-09    8        None   None 2020-05-01  2.956456
5         pa 2020-05-12   11        None   None 2020-05-01  3.190634
6         pa 2020-05-13   12        None   None 2020-05-01  3.220023
7         pa 2020-05-14   13        None   None 2020-05-01  3.231314
8         pa 2020-05-15   14        None   None 2020-05-01  3.239970

This estimate was clearly updated many times as new data for May 1st arrived. Note that these results include only data issued or updated between 2020-05-01 and 2020-05-15. If a value was first reported on 2020-04-15, and never updated, a query for issues between 2020-05-01 and 2020-05-15 will not include that value among its results.

Finally, we can use the lag argument to request only data reported with a certain lag. For example, requesting a lag of 7 days means to request only issues 7 days after the corresponding time_value:

>>> covidcast.signal("doctor-visits", "smoothed_cli",
...                  start_day=date(2020, 5, 1), end_day=date(2020, 5, 7),
...                  geo_type="state", geo_values="pa", lag=7)
   geo_value      issue  lag sample_size stderr time_value     value
0         pa 2020-05-08    7        None   None 2020-05-01  2.897032
0         pa 2020-05-09    7        None   None 2020-05-02  2.802238
0         pa 2020-05-12    7        None   None 2020-05-05  3.483125
0         pa 2020-05-13    7        None   None 2020-05-06  2.968670
0         pa 2020-05-14    7        None   None 2020-05-07  2.400255

Note that though this query requested all values between 2020-05-01 and 2020-05-07, May 3rd and May 4th were not included in the results set. This is because the query will only include a result for May 3rd if a value were issued on May 10th (a 7-day lag), but in fact the value was not updated on that day:

>>> covidcast.signal("doctor-visits", "smoothed_cli",
...                  start_day=date(2020, 5, 3), end_day=date(2020, 5, 3),
...                  geo_type="state", geo_values="pa",
...                  issues=(date(2020, 5, 9), date(2020, 5, 15)))
   geo_value      issue  lag sample_size stderr time_value     value
0         pa 2020-05-09    6        None   None 2020-05-03  2.749537
1         pa 2020-05-12    9        None   None 2020-05-03  2.989626
2         pa 2020-05-13   10        None   None 2020-05-03  3.006860
3         pa 2020-05-14   11        None   None 2020-05-03  2.970561
4         pa 2020-05-15   12        None   None 2020-05-03  3.038054

Dealing with geographies

As seen above, the COVIDcast API identifies counties by their FIPS code and states by two-letter abbreviations. Metropolitan statistical areas are also identified by unique codes, called CBSA IDs. (Exact details and exceptions are given in the geographic coding documentation.) If you want to find a specific area by name, this package provides convenience functions:

>>> covidcast.name_to_cbsa(["Houston", "San Antonio"])
['26420', '41700']

We can use these functions to quickly query data for specific regions:

>>> counties = covidcast.name_to_fips(["Allegheny", "Los Angeles", "Miami-Dade"])
>>> df = covidcast.signal("doctor-visits", "smoothed_cli",
...                       start_day=date(2020, 5, 1), end_day=date(2020, 5, 1),
...                       geo_values=counties)
>>> df
  geo_value        signal time_value      issue  lag     value stderr sample_size geo_type    data_source
0     42003  smoothed_cli 2020-05-01 2020-07-04   64  1.336086   None        None   county  doctor-visits
0     06037  smoothed_cli 2020-05-01 2020-07-04   64  5.787655   None        None   county  doctor-visits
0     12086  smoothed_cli 2020-05-01 2020-07-04   64  6.405477   None        None   county  doctor-visits

We can also quickly convert back from the IDs returned by the API to human-readable names:

>>> covidcast.fips_to_name(df.geo_value)
['Allegheny County', 'Los Angeles County', 'Miami-Dade County']

Because the functions support regular expression matching, we can quickly find all regions meeting certain criteria. For example, the five-digit FIPS codes used to identify counties use their first two digits to identify the state. We can find all counties in the state of Pennsylvania by querying for FIPS codes beginning with 42 and requesting all matches:

>>> pa_counties = covidcast.fips_to_name("^42.*", ties_method="all")
>>> pa_counties[0]
{'42000': ['Pennsylvania'], '42001': ['Adams County'], '42003': ['Allegheny County'], '42005': ['Armstrong County'], '42007': ['Beaver County'], '42009': ['Bedford County'], '42011': ['Berks County'], '42013': ['Blair County'], '42015': ['Bradford County'], '42017': ['Bucks County'], '42019': ['Butler County'], '42021': ['Cambria County'], '42023': ['Cameron County'], '42025': ['Carbon County'], '42027': ['Centre County'], '42029': ['Chester County'], '42031': ['Clarion County'], '42033': ['Clearfield County'], '42035': ['Clinton County'], '42037': ['Columbia County'], '42039': ['Crawford County'], '42041': ['Cumberland County'], '42043': ['Dauphin County'], '42045': ['Delaware County'], '42047': ['Elk County'], '42049': ['Erie County'], '42051': ['Fayette County'], '42053': ['Forest County'], '42055': ['Franklin County'], '42057': ['Fulton County'], '42059': ['Greene County'], '42061': ['Huntingdon County'], '42063': ['Indiana County'], '42065': ['Jefferson County'], '42067': ['Juniata County'], '42069': ['Lackawanna County'], '42071': ['Lancaster County'], '42073': ['Lawrence County'], '42075': ['Lebanon County'], '42077': ['Lehigh County'], '42079': ['Luzerne County'], '42081': ['Lycoming County'], '42083': ['McKean County'], '42085': ['Mercer County'], '42087': ['Mifflin County'], '42089': ['Monroe County'], '42091': ['Montgomery County'], '42093': ['Montour County'], '42095': ['Northampton County'], '42097': ['Northumberland County'], '42099': ['Perry County'], '42101': ['Philadelphia County'], '42103': ['Pike County'], '42105': ['Potter County'], '42107': ['Schuylkill County'], '42109': ['Snyder County'], '42111': ['Somerset County'], '42113': ['Sullivan County'], '42115': ['Susquehanna County'], '42117': ['Tioga County'], '42119': ['Union County'], '42121': ['Venango County'], '42123': ['Warren County'], '42125': ['Washington County'], '42127': ['Wayne County'], '42129': ['Westmoreland County'], '42131': ['Wyoming County'], '42133': ['York County']}

See Working with geographic identifiers for details on each of these functions and their optional arguments.