{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started\n", "\n", "The epidatpy package provides access to all the endpoints of the [Delphi Epidata\n", "API](https://cmu-delphi.github.io/delphi-epidata/), and can be used to make\n", "requests for specific signals on specific dates and in select geographic\n", "regions.\n", "\n", "## Basic usage\n", "\n", "Fetching data from the Delphi Epidata API is simple. Suppose we are\n", "interested in the [covidcast endpoint](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html),\n", "which provides access to a [wide range of data](https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html)\n", "on COVID-19. Reviewing the endpoint documentation, we see that we\n", "[need to specify](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html#constructing-api-queries)\n", "a data source name, a signal name, a geographic level, a time resolution, and\n", "the location and times of interest.\n", "\n", "The `pub_covidcast` function lets us access the `covidcast` endpoint. Here we\n", "demonstrate how to fetch the most up-to-date version of the confirmed cumulative COVID cases\n", "from the JHU CSSE data source at the national level." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "# Hidden cell (set in the metadata for this cell)\n", "import pandas as pd\n", "\n", "# Set common options and context\n", "pd.set_option(\"display.max_columns\", None)\n", "pd.set_option(\"display.max_rows\", 10)\n", "pd.set_option(\"display.width\", 1000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from epidatpy import CovidcastEpidata, EpiDataContext, EpiRange\n", "\n", "# Create the client object. Note that due to the arguments below all results\n", "# will be cached to your disk for 7 days, which helps avoid making repeated\n", "# downloads.\n", "epidata = EpiDataContext(use_cache=True, cache_max_age_days=7)\n", "\n", "# `pub_covidcast` returns an `EpiDataCall`, which is a not-yet-executed query\n", "# that can be inspected.\n", "apicall = epidata.pub_covidcast(\n", " data_source=\"jhu-csse\",\n", " signals=\"confirmed_cumulative_num\",\n", " geo_type=\"nation\",\n", " time_type=\"day\",\n", " geo_values=\"us\",\n", " time_values=EpiRange(20210405, 20210410),\n", ")\n", "print(apicall)\n", "# The query can be executed and converted to a DataFrame by using the `.df()`\n", "# method:\n", "apicall.df()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the pub_covidcast-specific client object. This you to find what sources\n", "# and signals are available without leaving your REPL.\n", "covidcast = CovidcastEpidata(use_cache=True, cache_max_age_days=7)\n", "# Get a list of all the sources available in the pub_covidcast endpoint.\n", "print(covidcast.source_names())\n", "print(covidcast.signal_names(\"jhu-csse\"))\n", "# Obtain the same data as above with a different interface.\n", "covidcast[\"jhu-csse\", \"confirmed_cumulative_num\"].call(\n", " \"nation\",\n", " \"us\",\n", " EpiRange(20210405, 20210410),\n", ").df()\n", "# See the \"Finding data of interest\" notebook for more features of this interface." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each row represents one observation in the US on one\n", "day. The geographical abbreviation is given in the `geo_value` column, the date in\n", "the `time_value` column. Here `value` is the requested signal -- in this\n", "case, the smoothed estimate of the percentage of people with COVID-like\n", "illness, based on the symptom surveys, and `stderr` is its standard error.\n", "\n", "The Epidata API makes signals available at different geographic levels,\n", "depending on the endpoint. To request signals for all states instead of the\n", "entire US, we use the `geo_type` argument paired with `*` for the\n", "`geo_values` argument. (Only some endpoints allow for the use of `*` to\n", "access data at all locations. Check the help for a given endpoint to see if\n", "it supports `*`.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epidata.pub_covidcast(\n", " data_source=\"fb-survey\",\n", " signals=\"smoothed_cli\",\n", " geo_type=\"state\",\n", " time_type=\"day\",\n", " geo_values=\"*\",\n", " time_values=EpiRange(20210405, 20210410),\n", ").df()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, we can fetch the full time series for a subset of states by \n", "listing out the desired locations in the `geo_value` argument and using\n", "`*` in the `time_values` argument:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epidata.pub_covidcast(\n", " data_source=\"fb-survey\",\n", " signals=\"smoothed_cli\",\n", " geo_type=\"state\",\n", " time_type=\"day\",\n", " geo_values=\"pa,ca,fl\",\n", " time_values=\"*\",\n", ").df()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting versioned data\n", "\n", "The Epidata API stores a historical record of all data, including corrections\n", "and updates, which is particularly useful for accurately backtesting\n", "forecasting models. To fetch versioned data, we can use the `as_of`\n", "argument:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epidata.pub_covidcast(\n", " data_source=\"fb-survey\",\n", " signals=\"smoothed_cli\",\n", " geo_type=\"state\",\n", " time_type=\"day\",\n", " geo_values=\"pa\",\n", " time_values=EpiRange(20210405, 20210410),\n", " as_of=\"2021-06-01\",\n", ").df()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Plotting\n", "\n", "Because the output data is a standard Pandas DataFrame, we can easily plot\n", "it using any of the available Python libraries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.rcParams[\"figure.dpi\"] = 300\n", "\n", "apicall = epidata.pub_covidcast(\n", " data_source=\"fb-survey\",\n", " signals=\"smoothed_cli\",\n", " geo_type=\"state\",\n", " geo_values=\"pa,ca,fl\",\n", " time_type=\"day\",\n", " time_values=EpiRange(20210405, 20210410),\n", ")\n", "\n", "fig, ax = plt.subplots(figsize=(6, 5))\n", "ax.spines[\"right\"].set_visible(False)\n", "ax.spines[\"left\"].set_visible(False)\n", "ax.spines[\"top\"].set_visible(False)\n", "\n", "(\n", " apicall.df()\n", " .pivot_table(values=\"value\", index=\"time_value\", columns=\"geo_value\")\n", " .plot(xlabel=\"Date\", ylabel=\"CLI\", ax=ax, linewidth=1.5)\n", ")\n", "\n", "plt.title(\"Smoothed CLI from Facebook Survey\", fontsize=16)\n", "plt.subplots_adjust(bottom=0.2)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding locations of interest\n", "\n", "Most data is only available for the US. Select endpoints report other countries at the national and/or regional levels. Endpoint descriptions explicitly state when they cover non-US locations.\n", "\n", "For endpoints that report US data, see the\n", "[geographic coding documentation](https://cmu-delphi.github.io/delphi-epidata/api/covidcast_geography.html)\n", "for available geographic levels.\n", "\n", "## International data\n", "\n", "International data is available via\n", "\n", "- `pub_dengue_nowcast` (North and South America)\n", "- `pub_ecdc_ili` (Europe)\n", "- `pub_kcdc_ili` (Korea)\n", "- `pub_nidss_dengue` (Taiwan)\n", "- `pub_nidss_flu` (Taiwan)\n", "- `pub_paho_dengue` (North and South America)\n", "- `pvt_dengue_sensors` (North and South America)\n", "\n", "## Finding data sources and signals of interest\n", "\n", "Above we used data from [Delphi’s symptom surveys](https://delphi.cmu.edu/covid19/ctis/),\n", "but the Epidata API includes numerous data streams: medical claims data, cases\n", "and deaths, mobility, and many others. This can make it a challenge to find\n", "the data stream that you are most interested in.\n", "\n", "The Epidata documentation lists all the data sources and signals available\n", "through the API for [COVID-19](https://cmu-delphi.github.io/delphi-epidata/api/covidcast_signals.html)\n", "and for [other diseases](https://cmu-delphi.github.io/delphi-epidata/api/README.html#source-specific-parameters).\n", "\n", "## Epiweeks and dates\n", "\n", "Formatting for epiweeks is YYYYWW and for dates is YYYYMMDD.\n", "\n", "Epiweeks use the U.S. CDC definition, which defines the first epiweek each year\n", "to be the first week containing January 4th and the start of the week is on\n", "Sunday. See [this\n", "page](https://www.cmmcp.org/mosquito-surveillance-data/pages/epi-week-calendars-2008-2021)\n", "for a less terse explanation. \n", "\n", "When specifying the time_values argument, you can use individual values,\n", "comma-separated lists or, a hyphenated range of values to specify single or\n", "several dates (or epiweeks). An `EpiRange` object can be also used to construct\n", "a range of epiweeks or dates. Examples include:\n", "\n", "- `param = 201530` (A single epiweek)\n", "- `param = '201401,201501,201601'` (Several epiweeks)\n", "- `param = '200501-200552'` (A range of epiweeks)\n", "- `param = '201440,201501-201510'` (Several epiweeks, including a range)\n", "- `param = EpiRange(20070101, 20071231)` (A range of dates)\n" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }