{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Accessing versioned data\n", "\n", "The Delphi Epidata API stores not just each signal's estimate for a given\n", "location on a given day, but also *when* that estimate was made, and all updates\n", "to that estimate.\n", "\n", "For example, let's look at the [doctor visits\n", "signal](https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/doctor-visits.html)\n", "from the [covidcast\n", "endpoint](https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html), which\n", "estimates the percentage of outpatient doctor visits that are COVID-related.\n", "\n", "Consider a result row with `time_value = 2020-05-01` for `geo_values = \"pa\"`.\n", "This is an estimate for Pennsylvania on May 1, 2020. That estimate was *issued*\n", "on May 5, 2020 (which is recorded in the `issue` column), the delay coming from\n", "a combination of:\n", "\n", "- time taken by our data partner to collect the data\n", "- time taken by the Dekohu Epidata API to ingest the data provided.\n", "\n", "Later, the estimate for May 1st could be updated, perhaps because additional\n", "visit data from May 1st arrived at our source and was reported to us. This\n", "constitutes a new *issue* of the data.\n", "\n", "## Data known \"as of\" a specific date\n", "\n", "By default, endpoint functions fetch the most recent issue available. This is\n", "the best option for users who simply want to graph the latest data or construct\n", "dashboards. But if we are interested in knowing *when* data was reported, we can\n", "request specific data versions using the `as_of`, `issues`, or `lag` arguments\n", "(note that these are mutually exclusive and that not all endpoints aside from\n", "`pub_covidcast` support all three parameters, so please check the documentation\n", "for that specific endpoint).\n", "\n", "First, we can request the data that was available *as it was available* on a\n", "specific date, using the `as_of` argument:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "# Hidden cell (set in the metadata for this cell)\n", "import pandas as pd\n", "\n", "# Set common options and context\n", "pd.set_option(\"display.max_columns\", None)\n", "pd.set_option(\"display.max_rows\", 10)\n", "pd.set_option(\"display.width\", 1000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from epidatpy import EpiDataContext, EpiRange\n", "\n", "epidata = EpiDataContext(use_cache=False)\n", "\n", "# Obtain the most up-to-date version of the smoothed covid-like illness (CLI)\n", "# signal from the COVID-19 Trends and Impact survey for the US\n", "epidata.pub_covidcast(\n", " data_source=\"doctor-visits\",\n", " signals=\"smoothed_cli\",\n", " time_type=\"day\",\n", " time_values=\"2020-05-01\",\n", " geo_type=\"state\",\n", " geo_values=\"pa\",\n", " as_of=\"2020-05-07\",\n", ").df()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows that an estimate of about 2.3% was issued on May 7. If we don't\n", "specify `as_of`, we get the most recent estimate available:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epidata.pub_covidcast(\n", " data_source=\"doctor-visits\",\n", " signals=\"smoothed_cli\",\n", " time_type=\"day\",\n", " time_values=\"2020-05-01\",\n", " geo_type=\"state\",\n", " geo_values=\"pa\",\n", ").df()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note the substantial change in the estimate, from less than 3% to over 5%,\n", "reflecting new data that became available after May 7 about visits *occurring on*\n", "May 1. This illustrates the importance of issue date tracking, particularly\n", "for forecasting tasks. To backtest a forecasting model on past data, it is\n", "important to use the data that would have been available *at the time* the model\n", "was or would have been fit, not data that arrived much later.\n", "\n", "By plotting API results with different values of the `as_of` parameter, we can\n", "see how the indicator value changes over time as new observations become available:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n", "plt.rcParams[\"figure.dpi\"] = 300\n", "\n", "results = []\n", "for as_of_date in [\"2020-05-07\", \"2020-05-14\", \"2020-05-21\", \"2020-05-28\"]:\n", " apicall = epidata.pub_covidcast(\n", " data_source=\"doctor-visits\",\n", " signals=\"smoothed_adj_cli\",\n", " time_type=\"day\",\n", " time_values=EpiRange(\"2020-04-20\", \"2020-04-27\"),\n", " geo_type=\"state\",\n", " geo_values=\"pa\",\n", " as_of=as_of_date,\n", " )\n", "\n", " results.append(apicall.df())\n", "\n", "final_df = pd.concat(results)\n", "final_df[\"issue\"] = final_df[\"issue\"].dt.date\n", "\n", "fig, ax = plt.subplots(figsize=(6, 5))\n", "ax.spines[\"right\"].set_visible(False)\n", "ax.spines[\"left\"].set_visible(False)\n", "ax.spines[\"top\"].set_visible(False)\n", "\n", "final_df.pivot_table(values=\"value\", index=\"time_value\", columns=\"issue\").plot(\n", " xlabel=\"Date\", ylabel=\"CLI\", ax=ax, linewidth=1.5\n", ")\n", "\n", "plt.title(\"Smoothed CLI from Doctor Visits\", fontsize=16)\n", "plt.subplots_adjust(bottom=0.2)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple issues of observations\n", "\n", "By using the `issues` argument, we can request all issues in a certain time\n", "period:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epidata.pub_covidcast(\n", " data_source=\"doctor-visits\",\n", " signals=\"smoothed_adj_cli\",\n", " time_type=\"day\",\n", " time_values=\"2020-05-01\",\n", " geo_type=\"state\",\n", " geo_values=\"pa\",\n", " issues=EpiRange(\"2020-05-01\", \"2020-05-15\"),\n", ").df()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This estimate was clearly updated many times as new data for May 1st arrived.\n", "Note that these results include only data issued or updated between (inclusive)\n", "2020-05-01 and 2020-05-15. If a value was first reported on 2020-04-15, and\n", "never updated, a query for issues between 2020-05-01 and 2020-05-15 will not\n", "include that value among its results. This view of the data is useful for\n", "understanding the revision patterns in a signal and can be useful for nowcasting\n", "(i.e. the practice of auto-correcting real-time estimates).\n", "\n", "## Observations issued with a specific lag\n", "\n", "Finally, we can use the `lag` argument to request only data reported with a\n", "certain lag. For example, requesting a lag of 7 days fetches only data issued\n", "exactly 7 days after the corresponding `time_value`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epidata.pub_covidcast(\n", " data_source=\"doctor-visits\",\n", " signals=\"smoothed_adj_cli\",\n", " time_type=\"day\",\n", " time_values=EpiRange(\"2020-05-01\", \"2020-05-01\"),\n", " geo_type=\"state\",\n", " geo_values=\"pa\",\n", " lag=7,\n", ").df()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that though this query requested all values between 2020-05-01 and\n", "2020-05-07, May 3rd and May 4th were *not* included in the results set. This is\n", "because the query will only include a result for May 3rd if a value were issued\n", "on May 10th (a 7-day lag), but in fact the value was not updated on that day:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "epidata.pub_covidcast(\n", " data_source=\"doctor-visits\",\n", " signals=\"smoothed_adj_cli\",\n", " time_type=\"day\",\n", " time_values=\"2020-05-03\",\n", " geo_type=\"state\",\n", " geo_values=\"pa\",\n", " issues=EpiRange(\"2020-05-09\", \"2020-05-15\"),\n", ").df()" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 2 }